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Preface -- RRC MATH1020 adaptation -- Version 2015 Revision B 

This text has been adapted specifically for MATH1020 at Red River 
College. It is intended for the one-semester introduction to statistics course 
for business students. It focuses on the interpretation of statistical results, 
especially in real world settings, and assumes that students have an 
understanding of intermediate algebra. In addition to end of section practice 
and homework sets, examples of each topic are explained step-by-step 
throughout the text and followed by a Try It problem that is designed as 
extra practice for students. This book also includes collaborative exercises 
and statistics labs designed to give students the opportunity to work 
together and explore key concepts. While the book has been built so that 
each chapter builds on the previous, it can be rearranged to accommodate 
any instructor’s particular needs. 


About Introductory Statistics 


This text has been adapted specifically for MATH1020 at Red River 
College. It is designed for the one-semester, introduction to statistics 
course, geared toward business students. This text assumes students have 
been exposed to intermediate algebra, and it focuses on the applications of 
Statistical knowledge rather than the theory behind it. 


The foundation of this textbook is Introductory Statistics, by Barbara 
Illowsky and Susan Dean. Additional topics, examples, and ample 
opportunities for practice have been added to each chapter. The 
development choices for this textbook were made with the guidance of 
many faculty members who are deeply involved in teaching this course. 
These choices led to innovations in art, terminology, and practical 
applications, all with a goal of increasing relevance and accessibility for 
students. We strove to make the discipline meaningful, so that students can 
draw from it a working knowledge that will enrich their future studies and 
help them make sense of the world around them. 


Coverage and Scope 
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Chapter 2 Descriptive Statistics 
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Chapter 4 Discrete Random Variables 
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Chapter 7 Confidence Intervals 
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Pedagogical Foundation and Features 


e Examples are placed strategically throughout the text to show students 
the step-by-step process of interpreting and solving statistical 
problems. To keep the text relevant for students, the examples are 
drawn from a broad spectrum of practical topics; these include 
examples about college life and learning, health and medicine, retail 
and business, and sports and entertainment. 

e Practice, Homework, and Bringing It Together problems give the 
students problems at various degrees of difficulty while also including 
real-world scenarios to engage students. 


Ancillaries 


¢ Elementary Business Statistics Course janux.ou.edu 
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Preface to OpenStax College's Introductory Statistics: Red 
River Custom Edition 


OpenStax College’s Introductory Statistics by senior contributing writers 
Barbara Illowsky and Susan Dean is a complete text in itself and thus the 
creation of a custom edition requires some rationale for all the effort that 
went into its creation. 


This custom edition for Red River College builds from the University of 
Oklahoma's adaptation of Introductory Statistics and maintains, for the most 


part, the structure of the material. Only does the order of the latter chapters 
on the Chi squared distribution and the F distribution change. The discrete 
probability density functions have been reordered in what is felt helps 
provide a logical development of probability density functions from simple 
counting formulas to more complex continuous distributions. What has 
been preserved and is a true foundation stone of both texts are the 
homework assignments and examples. Many additional homework 
assignments have been added and new examples that use a more 
mathematical approach are in the new text, but the wealth of examples, 
mostly with answers, are critical to student success and a keystone to this 
custom edition of Introductory Statistics. 


What differentiates this text from its foundation document grows out of a 
difference in philosophy toward the use of mathematical formulas. The 
significant and important work of the foundation text to help students 
master the Texas Instruments calculator has been discarded. All required 
calculations are within the capability of a $2.00 calculator, until regression, 
correlation and ANOVA, of course. It is my belief that students lose much if 
they do not see the formulas in action and develop a “feel” for what they are 
doing with the data. This requires additional material that helps students 
understand the combinatorial formula and factorials as well as sigma 
notation otherwise carried by the calculator. This difference in perspective 
then changes the acceptance/rejection rule for hypothesis testing to 
comparisons between calculated test statistics verse p-values. The 
terminology of confidence intervals, and the process of finding probabilities 
also changes including now the reliance upon statistical tables. 


Laying more emphasis on the development of the mathematical formulas 
requires a closer link to the fundamental theorem of inferential statistics, the 
Central Limit Theorem. This relationship is developed in the foundation 
text and given its proper critical role in statistical theory. This custom 
edition of Introductory Statistics repeats this link in each section for each 
test statistic developed; test for proportions, for differences in means and 
differences in proportions. 


This RRC custom edition Introductory Statistics owes much to the work of 
Dr. Illowsky and Ms. Dean in OpenStax College’s Introductory Statistics, 


and to its subsequent adaptation by Dr. Alexander Holmes and his team at 
University of Oklahoma. 


Introduction 
class="introduction" 


We 
encounte 
i 
Statistics 
in our 
daily 
lives 
more 
often 
than we 
probably 
realize 
and from 
many 
different 
sources, 
like the 


news. 

(credit: 

David 
Sim) 


You are probably asking yourself the question, "When and where will I use 
Statistics?" If you read any newspaper, watch television, or use the Internet, 
you will see statistical information. There are statistics about crime, sports, 
education, politics, and real estate. Typically, when you read a newspaper 
article or watch a television news program, you are given sample 
information. With this information, you may make a decision about the 
correctness of a statement, claim, or "fact." Statistical methods can help you 
make the "best educated guess." 


Since you will undoubtedly be given statistical information at some point in 
your life, you need to know some techniques for analyzing the information 
thoughtfully. Think about buying a house or managing a budget. Think 
about your chosen profession. The fields of economics, business, 
psychology, education, biology, law, computer science, police science, and 
early childhood development require at least one course in statistics. 


Included in this chapter are the basic ideas and words of probability and 
Statistics. You will soon understand that statistics and probability work 
together. You will also learn how data are gathered and what "good" data 
can be distinguished from "bad." 


Definitions of Statistics, Probability, and Key Terms 


The science of statistics deals with the collection, analysis, interpretation, 
and presentation of data. We see and use data in our everyday lives. 


In this course, you will learn how to organize and summarize data. 
Organizing and summarizing data is called descriptive statistics. Two ways 
to summarize data are by graphing and by using numbers (for example, 
finding an average). After you have studied probability and probability 
distributions, you will use formal methods for drawing conclusions from 
"good" data. The formal methods are called inferential statistics. Statistical 
inference uses probability to determine how confident we can be that our 
conclusions are correct. 


Effective interpretation of data (inference) is based on good procedures for 
producing data and thoughtful examination of the data. You will encounter 
what will seem to be too many mathematical formulas for interpreting data. 
The goal of statistics is not to perform numerous calculations using the 
formulas, but to gain an understanding of your data. The calculations can be 
done using a calculator or a computer. The understanding must come from 
you. If you can thoroughly grasp the basics of statistics, you can be more 
confident in the decisions you make in life. 


Probability 


Probability is a mathematical tool used to study randomness. It deals with 
the chance (the likelihood) of an event occurring. For example, if you toss a 
fair coin four times, the outcomes may not be two heads and two tails. 
However, if you toss the same coin 4,000 times, the outcomes will be close 
to half heads and half tails. The expected theoretical probability of heads in 
any one toss is y or 0.5. Even though the outcomes of a few repetitions are 
uncertain, there is a regular pattern of outcomes when there are many 
repetitions. After reading about the English statistician Karl Pearson who 
tossed a coin 24,000 times with a result of 12,012 heads, one of the authors 


tossed a coin 2,000 times. The results were 996 heads. The fraction sane is 


equal to 0.498 which is very close to 0.5, the expected probability. 


The theory of probability began with the study of games of chance such as 
poker. Predictions take the form of probabilities. To predict the likelihood 
of an earthquake, of rain, or whether you will get an A in this course, we 
use probabilities. Doctors use probability to determine the chance of a 
vaccination causing the disease the vaccination is supposed to prevent. A 
stockbroker uses probability to determine the rate of return on a client's 
investments. You might use probability to decide to buy a lottery ticket or 
not. In your study of statistics, you will use the power of mathematics 
through probability calculations to analyze and interpret your data. 


Key Terms 


In statistics, we generally want to study a population. You can think of a 
population as a collection of persons, things, or objects under study. To 
study the population, we select a sample. The idea of sampling is to select 
a portion (or subset) of the larger population and study that portion (the 
sample) to gain information about the population. Data are the result of 
sampling from a population. 


Because it takes a lot of time and money to examine an entire population, 
sampling is a very practical technique. If you wished to compute the overall 
grade point average at your school, it would make sense to select a sample 
of students who attend the school. The data collected from the sample 
would be the students' grade point averages. In presidential elections, 
opinion poll samples of 1,000—2,000 people are taken. The opinion poll is 
supposed to represent the views of the people in the entire country. 
Manufacturers of canned carbonated drinks take samples to determine if a 
16 ounce can contains 16 ounces of carbonated drink. 


From the sample data, we can calculate a statistic. A statistic is a number 
that represents a property of the sample. For example, if we consider one 
math class to be a sample of the population of all math classes, then the 
average number of points earned by students in that one math class at the 
end of the term is an example of a statistic. The statistic is an estimate of a 
population parameter, in this case the mean. A parameter is a numerical 
characteristic of the whole population that can be estimated by a statistic. 
Since we considered all math classes to be the population, then the average 


number of points earned per student over all the math classes is an example 
of a parameter. 


One of the main concerns in the field of statistics is how accurately a 
Statistic estimates a parameter. The accuracy really depends on how well the 
sample represents the population. The sample must contain the 
characteristics of the population in order to be a representative sample. We 
are interested in both the sample statistic and the population parameter in 
inferential statistics. In a later chapter, we will use the sample statistic to 
test the validity of the established population parameter. 


A variable, or random variable, usually notated by capital letters such as X 
and Y, is a characteristic or measurement that can be determined for each 
member of a population. Variables may be numerical or categorical. 
Numerical variables take on values with equal units such as weight in 
pounds and time in hours. Categorical variables place the person or thing 
into a category. If we let X equal the number of points earned by one math 
student at the end of a term, then X is a numerical variable. If we let Ybea 
person's party affiliation, then some examples of Y include Republican, 
Democrat, and Independent. Y is a categorical variable. We could do some 
math with values of X (calculate the average number of points earned, for 
example), but it makes no sense to do math with values of Y (calculating an 
average party affiliation makes no sense). 


Data are the actual values of the variable. They may be numbers or they 
may be words. Datum is a single value. 


Two words that come up often in statistics are mean and proportion. If you 
were to take three exams in your math classes and obtain scores of 86, 75, 
and 92, you would calculate your mean score by adding the three exam 
scores and dividing by three (your mean score would be 84.3 to one 
decimal place). If, in your math class, there are 40 students and 22 are men 
and 18 are women, then the proportion of men students is ae and the 


proportion of women students is a. Mean and proportion are discussed in 
more detail in later chapters. 


Note: 

NOTE 

The words "mean" and "average" are often used interchangeably. The 
substitution of one word for the other is common practice. The technical 
term is "arithmetic mean," and "average" is technically a center location. 
However, in practice among non-statisticians, "average" is commonly 
accepted for "arithmetic mean." 


Example: 
Exercise: 


Problem: 


Determine what the key terms refer to in the following study. We want 
to know the average (mean) amount of money first year college 
students spend at ABC College on school supplies that do not include 
books. We randomly surveyed 100 first year students at the college. 
Three of those students spent $150, $200, and $225, respectively. 


Solution: 


The population is all first year students attending ABC College this 
term. 


The sample could be all students enrolled in one section of a 
beginning statistics course at ABC College (although this sample may 
not represent the entire population). 


The parameter is the average (mean) amount of money spent 
(excluding books) by first year college students at ABC College this 
term: the population mean. 


The statistic is the average (mean) amount of money spent (excluding 
books) by first year college students in the sample. 


The variable could be the amount of money spent (excluding books) 
by one first year student. Let X = the amount of money spent 
(excluding books) by one first year student attending ABC College. 


The data are the dollar amounts spent by the first year students. 
Examples of the data are $150, $200, and $225. 


Note: 
Try It 
Exercise: 


Problem: 


Determine what the key terms refer to in the following study. We want 
to know the average (mean) amount of money spent on school 
uniforms each year by families with children at Knoll Academy. We 
randomly survey 100 families with children in the school. Three of 
the families spent $65, $75, and $95, respectively. 


Solution: 
Try It Solutions 


The population is all families with children attending Knoll 
Academy. 


The sample is a random selection of 100 families with children 
attending Knoll Academy. 


The parameter is the average (mean) amount of money spent on 
school uniforms by families with children at Knoll Academy. 


The statistic is the average (mean) amount of money spent on school 
uniforms by families in the sample. 


The variable is the amount of money spent by one family. Let X = the 
amount of money spent on school uniforms by one family with 


children attending Knoll Academy. 


The data are the dollar amounts spent by the families. Examples of 
the data are $65, $75, and $95. 


Example: 
Exercise: 


Problem: 
Determine what the key terms refer to in the following study. 


A study was conducted at a local college to analyze the average 
cumulative GPA’s of students who graduated last year. Fill in the letter 
of the phrase that best describes each of the items below. 


1. Population 2. Statistic 3. Parameter 4. Sample 
5. Variable 6. Data 


a. all students who attended the college last year 

b. the cumulative GPA of one student who graduated from the 
college last year 

Cos 0ny Oe ese eo) 

d. a group of students who graduated from the college last year, 
randomly selected 

e. the average cumulative GPA of students who graduated from the 
college last year 

f. all students who graduated from the college last year 

g. the average cumulative GPA of students in the study who 
graduated from the college last year 


Solution: 


1.f2.g3.e4.d5.b6.c 


Example: 
Exercise: 


Problem: 
Determine what the key terms refer to in the following study. 


As part of a study designed to test the safety of automobiles, the 
National Transportation Safety Board collected and reviewed data 
about the effects of an automobile crash on test dummies. Here is the 
criterion they used: 


Speed at which cars Location of “drive” (i.e. 
crashed dummies) 
35 miles/hour Front Seat 


Cars with dummies in the front seats were crashed into a wall at a 
speed of 35 miles per hour. We want to know the proportion of 
dummies in the driver’s seat that would have had head injuries, if they 
had been actual drivers. We start with a simple random sample of 75 
cars. 


Solution: 
The population is all cars containing dummies in the front seat. 
The sample is the 75 cars, selected by a simple random sample. 


The parameter is the proportion of driver dummies (if they had been 
real people) who would have suffered head injuries in the population. 


The statistic is proportion of driver dummies (if they had been real 
people) who would have suffered head injuries in the sample. 


The variable X = the number of driver dummies (if they had been real 
people) who would have suffered head injuries. 


The data are either: yes, had head injury, or no, did not. 


Example: 
Exercise: 


Problem: 
Determine what the key terms refer to in the following study. 


An insurance company would like to determine the proportion of all 
medical doctors who have been involved in one or more malpractice 
lawsuits. The company selects 500 doctors at random from a 
professional directory and determines the number in the sample who 
have been involved in a malpractice lawsuit. 


Solution: 


The population is all medical doctors listed in the professional 
directory. 


The parameter is the proportion of medical doctors who have been 
involved in one or more malpractice suits in the population. 


The sample is the 500 doctors selected at random from the 
professional directory. 


The statistic is the proportion of medical doctors who have been 
involved in one or more malpractice suits in the sample. 


The variable X = the number of medical doctors who have been 
involved in one or more malpractice suits. 


The data are either: yes, was involved in one or more malpractice 
lawsuits, or no, was not. 


References 


The Data and Story Library, 
http://lib.stat.cmu.edu/DASL/Stories/CrashTestDummies.html (accessed 
May 1, 2013). 


Chapter Review 


The mathematical theory of statistics is easier to learn when you know the 
language. This module presents important terms that will be used 
throughout the text. 


HOMEWORK 


For each of the following eight exercises, identify: a. the population, b. the 
sample, c. the parameter, d. the statistic, e. the variable, and f. the data. 
Give examples where appropriate. 

Exercise: 


Problem: 
A fitness center is interested in the mean amount of time a client 
exercises in the center each week. 
Exercise: 
Problem: 
Ski resorts are interested in the mean age that children take their first 


ski and snowboard lessons. They need this information to plan their ski 
classes optimally. 


Solution: 


a. all children who take ski or snowboard lessons 

b. a group of these children 

c. the population mean age of children who take their first 
snowboard lesson 

d. the sample mean age of children who take their first snowboard 
lesson 

e, X = the age of one child who takes his or her first ski or 
snowboard lesson 

f. values for X, such as 3, 7, and so on 


Exercise: 
Problem: 
A cardiologist is interested in the mean recovery period of her patients 
who have had heart attacks. 
Exercise: 
Problem: 
Insurance companies are interested in the mean health costs each year 


of their clients, so that they can determine the costs of health 
insurance. 


Solution: 


a. the clients of the insurance companies 

b. a group of the clients 

c. the mean health costs of the clients 

d. the mean health costs of the sample 

e. X = the health costs of one client 

f. values for X, such as 34, 9, 82, and so on 


Exercise: 


Problem: 
A politician is interested in the proportion of voters in his district who 
think he is doing a good job. 
Exercise: 
Problem: 


A marriage counselor is interested in the proportion of clients she 
counsels who stay married. 


Solution: 


a. all the clients of this counselor 

b. a group of clients of this marriage counselor 

c. the proportion of all her clients who stay married 

d. the proportion of the sample of the counselor’s clients who stay 
married 

e, X = the number of couples who stay married 

f. yes, no 


Exercise: 
Problem: 
Political pollsters may be interested in the proportion of people who 
will vote for a particular cause. 
Exercise: 
Problem: 


A marketing company is interested in the proportion of people who 
will buy a particular product. 


Solution: 


a. all people (maybe in a certain geographic area, such as the United 
States) 


b. a group of the people 

c. the proportion of all people who will buy the product 
d. the proportion of the sample who will buy the product 
e. X = the number of people who will buy it 

f. buy, not buy 


Use the following information to answer the next three exercises: A Lake 
Tahoe Community College instructor is interested in the mean number of 
days Lake Tahoe Community College math students are absent from class 
during a quarter. 

Exercise: 


Problem: What is the population she is interested in? 


a. all Lake Tahoe Community College students 

b. all Lake Tahoe Community College English students 

c. all Lake Tahoe Community College students in her classes 
d. all Lake Tahoe Community College math students 


Exercise: 


Problem: Consider the following: 


X = number of days a Lake Tahoe Community College math student is 
absent 


In this case, X is an example of a: 
a. variable. 
b. population. 


c. Statistic. 
d. data. 


Solution: 


a 
Exercise: 


Problem: 


The instructor’s sample produces a mean number of days absent of 3.5 
days. This value is an example of a: 


a. parameter. 
b. data. 

c. Statistic. 
d. variable. 


Glossary 


Average 
also called mean or arithmetic mean; a number that describes the 
central tendency of the data 


Categorical Variable 
variables that take on values that are names or labels 


Data 
a set of observations (a set of possible outcomes); most data can be put 
into two groups: qualitative (an attribute whose value is indicated by a 
label) or quantitative (an attribute whose value is indicated by a 
number). Quantitative data can be separated into two subgroups: 
discrete and continuous. Data is discrete if it is the result of counting 
(such as the number of students of a given ethnic group in a class or 
the number of books on a shelf). Data is continuous if it is the result of 
measuring (such as distance traveled or weight of luggage) 


Mathematical Models 
a description of a phenomenon using mathematical concepts, such as 
equations, inequalities, distributions, etc. 


Numerical Variable 
variables that take on values that are indicated by numbers 


Observational Study 
a study in which the independent variable is not manipulated by the 
researcher 


Parameter 
a number that is used to represent a population characteristic and that 
generally cannot be determined easily 


Population 
all individuals, objects, or measurements whose properties are being 
studied 


Probability 
a number between zero and one, inclusive, that gives the likelihood 
that a specific event will occur 


Proportion 
the number of successes divided by the total number in the sample 


Representative Sample 
a subset of the population that has the same characteristics as the 
population 


Sample 
a subset of the population studied 


Statistic 
a numerical characteristic of the sample; a statistic estimates the 
corresponding population parameter. 


Statistical Models 
a description of a phenomenon using probability distributions that 
describe the expected behavior of the phenomenon and the variability 
in the expected observations. 


Survey 
a study in which data is collected as reported by individuals. 


Variable 
a characteristic of interest for each person or object in a population 


Data, Sampling, and Variation in Data and Sampling 


Data may come from a population or from a sample. Lowercase letters like x 
or y generally are used to represent data values. Most data can be put into the 
following categories: 


¢ Qualitative 
¢ Quantitative 


Qualitative data are the result of categorizing or describing attributes of a 
population. Qualitative data are also often called categorical data. Hair color, 
blood type, ethnic group, the car a person drives, and the street a person lives 
on are examples of qualitative(categorical) data. Qualitative(categorical) data 
are generally described by words or letters. For instance, hair color might be 
black, dark brown, light brown, blonde, gray, or red. Blood type might be 
AB+, O-, or B+. Researchers often prefer to use quantitative data over 
qualitative(categorical) data because it lends itself more easily to mathematical 
analysis. For example, it does not make sense to find an average hair color or 
blood type. 


Quantitative data are always numbers. Quantitative data are the result of 
counting or measuring attributes of a population. Amount of money, pulse 
rate, weight, number of people living in your town, and number of students 
who take statistics are examples of quantitative data. Quantitative data may be 
either discrete or continuous. 


All data that are the result of counting are called quantitative discrete data. 
These data take on only certain numerical values. If you count the number of 
phone calls you receive for each day of the week, you might get values such as 
zero, one, two, or three. 


Data that are not only made up of counting numbers, but that may include 
fractions, decimals, or irrational numbers, are called quantitative continuous 
data. Continuous data are often the results of measurements like lengths, 
weights, or times. A list of the lengths in minutes for all the phone calls that 
you make in a week, with numbers like 2.4, 7.5, or 11.0, would be quantitative 
continuous data. 


Example: 

Data Sample of Quantitative Discrete Data 

The data are the number of books students carry in their backpacks. You 
sample five students. Two students carry three books, one student carries four 
books, one student carries two books, and one student carries one book. The 
numbers of books (three, four, two, and one) are the quantitative discrete data. 


Note: 
Try It 
Exercise: 


Problem: 


The data are the number of machines in a gym. You sample five gyms. 
One gym has 12 machines, one gym has 15 machines, one gym has ten 
machines, one gym has 22 machines, and the other gym has 20 
machines. What type of data is this? 


Solution: 
Try It Solutions 


quantitative discrete data 


Example: 

Data Sample of Quantitative Continuous Data 

The data are the weights of backpacks with books in them. You sample the 
same five students. The weights (in pounds) of their backpacks are 6.2, 7, 6.8, 
9.1, 4.3. Notice that backpacks carrying three books can have different 
weights. Weights are quantitative continuous data. 


Note: 
Try It 
Exercise: 


Problem: 


The data are the areas of lawns in square feet. You sample five houses. 
The areas of the lawns are 144 sq. feet, 160 sq. feet, 190 sq. feet, 180 sq. 
feet, and 210 sq. feet. What type of data is this? 


Solution: 
Try It Solutions 


quantitative continuous data 


Example: 

You go to the supermarket and purchase three cans of soup (19 ounces) 
tomato bisque, 14.1 ounces lentil, and 19 ounces Italian wedding), two 
packages of nuts (walnuts and peanuts), four different kinds of vegetable 
(broccoli, cauliflower, spinach, and carrots), and two desserts (16 ounces 
pistachio ice cream and 32 ounces chocolate chip cookies). 

Exercise: 


Problem: 


Name data sets that are quantitative discrete, quantitative continuous, 
and qualitative(categorical). 


Solution: 
One Possible Solution: 


e The three cans of soup, two packages of nuts, four kinds of 
vegetables and two desserts are quantitative discrete data because 
you count them. 

e The weights of the soups (19 ounces, 14.1 ounces, 19 ounces) are 
quantitative continuous data because you measure weights as 
precisely as possible. 

e Types of soups, nuts, vegetables and desserts are 
qualitative(categorical) data because they are categorical. 


Try to identify additional data sets in this example. 


Example: 

The data are the colors of backpacks. Again, you sample the same five 
students. One student has a red backpack, two students have black backpacks, 
one student has a green backpack, and one student has a gray backpack. The 
colors red, black, black, green, and gray are qualitative(categorical) data. 


Note: 
Try It 
Exercise: 


Problem: 


The data are the colors of houses. You sample five houses. The colors of 
the houses are white, yellow, white, red, and white. What type of data is 
this? 


Solution: 
Try It Solutions 


qualitative(categorical) data 


Note: 

Note 

You may collect data as numbers and report it categorically. For example, the 
quiz scores for each student are recorded throughout the term. At the end of 
the term, the quiz scores are reported as A, B, C, D, or F. 


Example: 
Exercise: 


Problem: 


Work collaboratively to determine the correct data type (quantitative or 
qualitative). Indicate whether quantitative data are continuous or 
discrete. Hint: Data that are discrete often start with the words "the 
number of." 


a. the number of pairs of shoes you own 

b. the type of car you drive 

c. the distance from your home to the nearest grocery store 
d. the number of classes you take per school year 

e. the type of calculator you use 

f. weights of sumo wrestlers 

g. number of correct answers on a quiz 

h. IQ scores (This may cause some discussion.) 


Solution: 


Items a, d, and g are quantitative discrete; items c, f, and h are 
quantitative continuous; items b and e are qualitative, or categorical. 


Note: 
Try It 
Exercise: 


Problem: 


Determine the correct data type (quantitative or qualitative) for the 
number of cars in a parking lot. Indicate whether quantitative data are 
continuous or discrete. 


Solution: 
Try It Solutions 


quantitative discrete 


Example: 
Exercise: 


Problem: 


A statistics professor collects information about the classification of her 
students as freshmen, sophomores, juniors, or seniors. The data she 
collects are summarized in the pie chart [link]. What type of data does 
this graph show? 

Classification of Statistics Students 


' Freshman 

® Sophomore 

_ Junior 
Senior 


Solution: 


This pie chart shows the students in each year, which is qualitative (or 
categorical) data. 


Note: 
Try It 
Exercise: 


Problem: 


The registrar at State University keeps records of the number of credit 
hours students complete each semester. The data he collects are 
summarized in the histogram. The class boundaries are 10 to less than 
13, 13 to less than 16, 16 to less than 19, 19 to less than 22, and 22 to 
less than 25. 


Number of Credit Hours 
Completed per Students 


Number of students 


10 13 16 19 22 25 
Credit hours completed 


What type of data does this graph show? 


Solution: 
Try It Solutions 


A histogram is used to display quantitative data: the numbers of credit 
hours completed. Because students can complete only a whole number 
of hours (no fractions of hours allowed), this data is quantitative 
discrete. 


Qualitative Data Discussion 


Below are tables comparing the number of part-time and full-time students at 
De Anza College and Foothill College enrolled for the spring 2010 quarter. 
The tables display counts (frequencies) and percentages or proportions 
(relative frequencies). The percent columns make comparing the same 
categories in the colleges easier. Displaying percentages along with the 
numbers is often helpful, but it is particularly important when comparing sets 
of data that do not have the same totals, such as the total enrollments for both 
colleges in this example. Notice how much larger the percentage for part-time 
students at Foothill College is compared to De Anza College. 


De Anza College Foothill College 


Number Percent Number Percent 
a 9,200 40.9% foe 4,059 28.6% 
time time 
Part- | 13.296 59.1% Part- 10,124 71.4% 
time time 
Total 22,496 100% Total 14,183 100% 


Fall Term 2007 (Census day) 


Tables are a good way of organizing and displaying data. But graphs can be 
even more helpful in understanding the data. There are no strict rules 
concerning which graphs to use. Two graphs that are used to display 
qualitative(categorical) data are pie charts and bar graphs. 


In a pie chart, categories of data are represented by wedges in a circle and are 
proportional in size to the percent of individuals in each category. 


In a bar graph, the length of the bar for each category is proportional to the 
number or percent of individuals in each category. Bars may be vertical or 
horizontal. 


A Pareto chart consists of bars that are sorted into order by category size 
(largest to smallest). 


Look at [link] and [link] and determine which graph (pie or bar) you think 
displays the comparisons better. 


It is a good idea to look at a variety of graphs to see which is the most helpful 
in displaying the data. We might make different choices of what we think is 
the “best” graph depending on the data and the context. Our choice also 
depends on what we are using the data for. 


De Anza College Foothill College 


' Part time 
® Full time 


' Part time 
®@ Full time 


Student Status 


14000 13296 


De Anza Foothill 
®@ Fulltine © Part time 


Percentages That Add to More (or Less) Than 100% 


Sometimes percentages add up to be more than 100% (or less than 100%). In 
the graph, the percentages add to more than 100% because students can be in 
more than one category. A bar graph is appropriate to compare the relative 
size of the categories. A pie chart cannot be used. It also could not be used if 
the percentages added to less than 100%. 


Characteristic/category Percent 


Full-time students 40.9% 
Students who intend to transfer to a 4-year educational AB.6% 
institution 

Students under age 25 61.0% 
TOTAL 150.5% 


De Anza College Spring 2010 


9 
100% 100.0% 


80% 


61.0% 


60% 


40% 


20% 


0% 
Under Intend to Full-time All students 
age 25 transfer 


Omitting Categories/Missing Data 


The table displays Ethnicity of Students but is missing the "Other/Unknown" 
category. This category contains people who did not feel they fit into any of 
the ethnicity categories or declined to respond. Notice that the frequencies do 
not add up to the total number of students. In this situation, create a bar graph 
and not a pie chart. 


Frequency Percent 


Asian 8,794 36.1% 

Black 1,412 5.8% 

Filipino 1,298 5.3% 

Hispanic 4,180 17.1% 

Native American 146 0.6% 

Pacific Islander 236 1.0% 

White 5,978 24.5% 

TOTAL 22,044 out of 24,382 90.4% out of 100% 


Ethnicity of Students at De Anza College Fall Term 2007 (Census Day) 


Ethnicity of Students 

40.0% 
35.0% 
30.0% 
25.0% 
20.0% 
15.0% 
10.0% 
5.0% 
0.0% 


36.1% 


5.8% 5.3% 


0.6% 1.0% 


Asian Black Filipino Hispanic Native Pacific White 
American — Islander 


The following graph is the same as the previous graph but the 
“Other/Unknown” percent (9.6%) has been included. The “Other/Unknown” 
category is large compared to some of the other categories (Native American, 
0.6%, Pacific Islander 1.0%). This is important to know when we think about 
what the data are telling us. 


This particular bar graph in [link] can be difficult to understand visually. The 
graph in [link] is a Pareto chart. The Pareto chart has the bars sorted from 
largest to smallest and is easier to read and interpret. 
Bar Graph with Other/Unknown Category 
Ethnicity of Students 
40.0% 
35.0% 
30.0% 
25.0% 
20.0% 
15.0% 
10.0% 
5.0% 
0.0% 


36.1% 


17.1% 


5.8% 5.3% 


Asian Black Filipino Hispanic Native Pacific White Other/ 
American Islander Unknown 


Pareto Chart With Bars Sorted by Size 


Ethnicity of Students 


40.0% 
35.0% 
30.0% 
25.0% 
20.0% 
15.0% 
10.0% 

5.0% 

0.0% 


36.1% 


9.6% 


5.8% 5.3% 


Asian White Hispanic Other/ Black Filipino —- Pacific Native 
Unknown Islander American 


Pie Charts: No Missing Data 


The following pie charts have the “Other/Unknown” category included (since 
the percentages must add to 100%). The chart in [link] is organized by the size 
of each wedge, which makes it a more visually informative graph than the 
unsorted, alphabetical graph in [link]. 


Ethnicity of Students Ethnicity of Students 


9.6% 1.0% 
! Asian 
® Black 
| Filipino Asian 
={be eae White 
' Native American { ; 
@ Pacific Islander = eee 
1) White | er 
Other ™ Black 
® Filipino 
> Pacific Islander 
xe Native American 


Sampling 


Gathering information about an entire population often costs too much or is 
virtually impossible. Instead, we use a sample of the population. A sample 
should have the same characteristics as the population it is representing. 
Most statisticians use various methods of random sampling in an attempt to 
achieve this goal. This section will describe a few of the most common 
methods. There are several different methods of random sampling. In each 
form of random sampling, each member of a population initially has an equal 
chance of being selected for the sample. Each method has pros and cons. The 
easiest method to describe is called a simple random sample. Any group of n 
individuals is equally likely to be chosen as any other group of n individuals if 
the simple random sampling technique is used. In other words, each sample of 
the same size has an equal chance of being selected. 


Besides simple random sampling, there are other forms of sampling that 
involve a chance process for getting the sample. Other well-known random 
sampling methods are the stratified sample, the cluster sample, and the 
systematic sample. 


To choose a stratified sample, divide the population into groups called strata 
and then take a proportionate number from each stratum. For example, you 
could stratify (group) your college population by department and then choose 
a proportionate simple random sample from each stratum (each department) to 
get a stratified random sample. To choose a simple random sample from each 
department, number each member of the first department, number each 
member of the second department, and do the same for the remaining 


departments. Then use simple random sampling to choose proportionate 
numbers from the first department and do the same for each of the remaining 
departments. Those numbers picked from the first department, picked from the 
second department, and so on represent the members who make up the 
stratified sample. 


To choose a cluster sample, divide the population into clusters (groups) and 
then randomly select some of the clusters. All the members from these clusters 
are in the cluster sample. For example, if you randomly sample four 
departments from your college population, the four departments make up the 
cluster sample. Divide your college faculty by department. The departments 
are the clusters. Number each department, and then choose four different 
numbers using simple random sampling. All members of the four departments 
with those numbers are the cluster sample. 


To choose a systematic sample, randomly select a starting point and take 
every n'" piece of data from a listing of the population. For example, suppose 
you have to do a phone survey. Your phone book contains 20,000 residence 
listings. You must choose 400 names for the sample. Number the population 
1—20,000 and then use a simple random sample to pick a number that 
represents the first name in the sample. Then choose every fiftieth name 
thereafter until you have a total of 400 names (you might have to go back to 
the beginning of your phone list). Systematic sampling is frequently chosen 
because it is a simple method. 


A type of sampling that is non-random is convenience sampling. Convenience 
sampling involves using results that are readily available. For example, a 
computer software store conducts a marketing study by interviewing potential 
customers who happen to be in the store browsing through the available 
software. The results of convenience sampling may be very good in some 
cases and highly biased (favor certain outcomes) in others. 


Sampling data should be done very carefully. Collecting data carelessly can 
have devastating results. Surveys mailed to households and then returned may 
be very biased (they may favor a certain group). It is better for the person 
conducting the survey to select the sample respondents. 


True random sampling is done with replacement. That is, once a member is 
picked, that member goes back into the population and thus may be chosen 


more than once. However for practical reasons, in most populations, simple 
random sampling is done without replacement. Surveys are typically done 
without replacement. That is, a member of the population may be chosen only 
once. Most samples are taken from large populations and the sample tends to 
be small in comparison to the population. Since this is the case, sampling 
without replacement is approximately the same as sampling with replacement 
because the chance of picking the same individual more than once with 
replacement is very low. 


In a college population of 10,000 people, suppose you want to pick a sample 
of 1,000 randomly for a survey. For any particular sample of 1,000, if you 
are sampling with replacement, 


e the chance of picking the first person is 1,000 out of 10,000 (0.1000); 

e the chance of picking a different second person for this sample is 999 out 
of 10,000 (0.0999); 

e the chance of picking the same person again is 1 out of 10,000 (very 
low). 


If you are sampling without replacement, 


e the chance of picking the first person for any particular sample is 1000 
out of 10,000 (0.1000); 

e the chance of picking a different second person is 999 out of 9,999 
(0.0999); 

¢ you do not replace the first person before picking the next person. 


Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the 
decimal answers to four decimal places. To four decimal places, these 
numbers are equivalent (0.0999). 


Sampling without replacement instead of sampling with replacement becomes 
a mathematical issue only when the population is small. For example, if the 
population is 25 people, the sample is ten, and you are sampling with 
replacement for any particular sample, then the chance of picking the first 
person is ten out of 25, and the chance of picking a different second person is 
nine out of 25 (you replace the first person). 


If you sample without replacement, then the chance of picking the first 
person is ten out of 25, and then the chance of picking the second person (who 
is different) is nine out of 24 (you do not replace the first person). 


Compare the fractions 9/25 and 9/24. To four decimal places, 9/25 = 0.3600 
and 9/24 = 0.3750. To four decimal places, these numbers are not equivalent. 


When you analyze data, it is important to be aware of sampling errors and 
nonsampling errors. The actual process of sampling causes sampling errors. 
For example, the sample may not be large enough. Factors not related to the 
sampling process cause nonsampling errors. A defective counting device can 
cause a nonsampling error. 


In reality, a sample will never be exactly representative of the population so 
there will always be some sampling error. As a rule, the larger the sample, the 
smaller the sampling error. 


In statistics, a sampling bias is created when a sample is collected from a 
population and some members of the population are not as likely to be chosen 
as others (remember, each member of the population should have an equally 
likely chance of being chosen). When a sampling bias happens, there can be 
incorrect conclusions drawn about the population that is being studied. 


Critical Evaluation 


We need to evaluate the statistical studies we read about critically and analyze 
them before accepting the results of the studies. Common problems to be 
aware of include 


e Problems with samples: A sample must be representative of the 
population. A sample that is not representative of the population is 
biased. Biased samples that are not representative of the population give 
results that are inaccurate and not valid. 

e Self-selected samples: Responses only by people who choose to respond, 
such as call-in surveys, are often unreliable. 

e Sample size issues: Samples that are too small may be unreliable. Larger 
samples are better, if possible. In some situations, having small samples 


is unavoidable and can still be used to draw conclusions. Examples: crash 
testing cars or medical testing for rare conditions 

e Undue influence: collecting data or asking questions in a way that 
influences the response 

e Non-response or refusal of subject to participate: The collected responses 
may no longer be representative of the population. Often, people with 
strong positive or negative opinions may answer surveys, which can 
affect the results. 

e Causality: A relationship between two variables does not mean that one 
causes the other to occur. They may be related (correlated) because of 
their relationship through a different variable. 

e Self-funded or self-interest studies: A study performed by a person or 
organization in order to support their claim. Is the study impartial? Read 
the study carefully to evaluate the work. Do not automatically assume 
that the study is good, but do not automatically assume the study is bad 
either. Evaluate it on its merits and the work done. 

e Misleading use of data: improperly displayed graphs, incomplete data, or 
lack of context 

e Confounding: When the effects of multiple factors on a response cannot 
be separated. Confounding makes it difficult or impossible to draw valid 
conclusions about the effect of each factor. 


Example: 
Exercise: 


Problem: 


A study is done to determine the average tuition that San Jose State 
undergraduate students pay per semester. Each student in the following 
samples is asked how much tuition he or she paid for the Fall semester. 
What is the type of sampling in each case? 


a. A sample of 100 undergraduate San Jose State students is taken by 
organizing the students’ names by classification (freshman, 
sophomore, junior, or senior), and then selecting 25 students from 
each. 


b. A random number generator is used to select a student from the 
alphabetical listing of all undergraduate students in the Fall 
semester. Starting with that student, every 50th student is chosen 
until 75 students are included in the sample. 

c. A completely random method is used to select 75 students. Each 
undergraduate student in the fall semester has the same probability 
of being chosen at any stage of the sampling process. 

d. The freshman, sophomore, junior, and senior years are numbered 
one, two, three, and four, respectively. A random number generator 
is used to pick two of those years. All students in those two years 
are in the sample. 

e. An administrative assistant is asked to stand in front of the library 
one Wednesday and to ask the first 100 undergraduate students he 
encounters what they paid for tuition the Fall semester. Those 100 
students are the sample. 


Solution: 


a. Stratified; b. systematic; c. simple random; d. cluster; e. convenience 


Example: 
Exercise: 


Problem: 


Determine the type of sampling used (simple random, stratified, 
systematic, cluster, or convenience). 


a. A soccer coach selects six players from a group of boys aged eight 
to ten, seven players from a group of boys aged 11 to 12, and three 
players from a group of boys aged 13 to 14 to form a recreational 
soccer team. 

b. A pollster interviews all human resource personnel in five different 
high tech companies. 

c. A high school educational researcher interviews 50 high school 
female teachers and 50 high school male teachers. 


d. A medical researcher interviews every third cancer patient from a 
list of cancer patients at a local hospital. 

e. A high school counselor uses a computer to generate 50 random 
numbers and then picks students whose names correspond to the 
numbers. 

f. A student interviews classmates in his algebra class to determine 
how many pairs of jeans a student owns, on the average. 


Solution: 


a. stratified; b. cluster; c. stratified; d. systematic; e. simple random; 
f.convenience 


If we were to examine two samples representing the same population, even if 
we used random sampling methods for the samples, they would not be exactly 
the same. Just as there is variation in data, there is variation in samples. As 
you become accustomed to sampling, the variability will begin to seem 
natural. 


Example: 

Suppose ABC College has 10,000 part-time students (the population). We are 
interested in the average amount of money a part-time student spends on 
books in the fall term. Asking all 10,000 students is an almost impossible 
task. 

Suppose we take two different samples. 

First, we use convenience sampling and survey ten students from a first term 
organic chemistry class. Many of these students are taking first term calculus 
in addition to the organic chemistry class. The amount of money they spend 
on books is as follows: 

$128 $87 $173 $116 $130 $204 $147 $189 $93 $153 

The second sample is taken using a list of senior citizens who take P.E. 
classes and taking every fifth senior citizen on the list, for a total of ten senior 
citizens. They spend: 

$50 $40 $36 $15 $50 $100 $40 $53 $22 $22 


It is unlikely that any student is in both samples. 
Exercise: 


Problem: 


a. Do you think that either of these samples is representative of (or is 
characteristic of) the entire 10,000 part-time student population? 


Solution: 


a. No. The first sample probably consists of science-oriented students. 
Besides the chemistry course, some of them are also taking first-term 
calculus. Books for these classes tend to be expensive. Most of these 
students are, more than likely, paying more than the average part-time 
student for their books. The second sample is a group of senior citizens 
who are, more than likely, taking courses for health and interest. The 
amount of money they spend on books is probably much less than the 
average parttime student. Both samples are biased. Also, in both cases, 
not all students have a chance to be in either sample. 


Exercise: 


Problem: 


b. Since these samples are not representative of the entire population, is 
it wise to use the results to describe the entire population? 


Solution: 


b. No. For these samples, each member of the population did not have an 
equally likely chance of being chosen. 


Now, suppose we take a third sample. We choose ten different part-time 
students from the disciplines of chemistry, math, English, psychology, 
sociology, history, nursing, physical education, art, and early childhood 
development. (We assume that these are the only disciplines in which part- 
time students at ABC College are enrolled and that an equal number of part- 
time students are enrolled in each of the disciplines.) Each student is chosen 
using simple random sampling. Using a calculator, random numbers are 
generated and a student from a particular discipline is selected if he or she has 
a corresponding number. The students spend the following amounts: 


$180 $50 $150 $85 $260 $75 $180 $200 $200 $150 
Exercise: 


Problem: c. Is the sample biased? 


Solution: 


c. The sample is unbiased, but a larger sample would be recommended 
to increase the likelihood that the sample will be close to representative 
of the population. However, for a biased sampling technique, even a 

large sample runs the risk of not being representative of the population. 


Students often ask if it is "good enough" to take a sample, instead of 
surveying the entire population. If the survey is done well, the answer is yes. 


Note: 
Try It 
Exercise: 


Problem: 


A local radio station has a fan base of 20,000 listeners. The station wants 
to know if its audience would prefer more music or more talk shows. 
Asking all 20,000 listeners is an almost impossible task. 


The station uses convenience sampling and surveys the first 200 people 
they meet at one of the station’s music concert events. 24 people said 
they’d prefer more talk shows, and 176 people said they’d prefer more 
music. 


Do you think that this sample is representative of (or is characteristic of) 
the entire 20,000 listener population? 


Solution: 
Try It Solutions 


The sample probably consists more of people who prefer music because 
it is a concert event. Also, the sample represents only those who showed 


up to the event earlier than the majority. The sample probably doesn’t 
represent the entire fan base and is probably biased towards people who 
would prefer music. 


Variation in Data 


Variation is present in any set of data. For example, 16-ounce cans of 
beverage may contain more or less than 16 ounces of liquid. In one study, 
eight 16 ounce cans were measured and produced the following amount (in 
ounces) of beverage: 


15.8 16.1 15.2 14.8 15.8 15.9 16.0 15.5 


Measurements of the amount of beverage in a 16-ounce can may vary because 
different people make the measurements or because the exact amount, 16 
ounces of liquid, was not put into the cans. Manufacturers regularly run tests 
to determine if the amount of beverage in a 16-ounce can falls within the 
desired range. 


Be aware that as you take data, your data may vary somewhat from the data 
someone else is taking for the same purpose. This is completely natural. 
However, if two or more of you are taking the same data and get very different 
results, it is time for you and the others to reevaluate your data-taking methods 
and your accuracy. 


Variation in Samples 


It was mentioned previously that two or more samples from the same 
population, taken randomly, and having close to the same characteristics of 
the population will likely be different from each other. Suppose Doreen and 
Jung both decide to study the average amount of time students at their college 
sleep each night. Doreen and Jung each take samples of 500 students. Doreen 
uses systematic sampling and Jung uses cluster sampling. Doreen's sample 
will be different from Jung's sample. Even if Doreen and Jung used the same 
sampling method, in all likelihood their samples would be different. Neither 
would be wrong, however. 


Think about what contributes to making Doreen’s and Jung’s samples 
different. 


If Doreen and Jung took larger samples (i.e. the number of data values is 
increased), their sample results (the average amount of time a student sleeps) 
might be closer to the actual population average. But still, their samples would 
be, in all likelihood, different from each other. This variability in samples 
cannot be stressed enough. 


Size of a Sample 


The size of a sample (often called the number of observations, usually given 
the symbol n) is important. The examples you have seen in this book so far 
have been small. Samples of only a few hundred observations, or even 
smaller, are sufficient for many purposes. In polling, samples that are from 
1,200 to 1,500 observations are considered large enough and good enough if 
the survey is random and is well done. Later we will find that even much 
smaller sample sizes will give very good results. You will learn why when you 
study confidence intervals. 


Be aware that many large samples are biased. For example, call-in surveys are 
invariably biased, because people choose to respond or not. 
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Chapter Review 


Data are individual items of information that come from a population or 
sample. Data may be classified as qualitative (categorical), quantitative 
continuous, or quantitative discrete. 


Because it is not practical to measure the entire population in a study, 
researchers use samples to represent the population. A random sample is a 
representative group from the population chosen by using a method that gives 
each individual in the population an equal chance of being included in the 
sample. Random sampling methods include simple random sampling, 
stratified sampling, cluster sampling, and systematic sampling. Convenience 
sampling is a nonrandom method of choosing a sample that often produces 
biased data. 


Samples that contain different individuals result in different data. This is true 
even when the samples are well-chosen and representative of the population. 
When properly selected, larger samples model the population more closely 
than smaller samples. There are many different potential problems that can 
affect the reliability of a sample. Statistical data needs to be critically 
analyzed, not simply accepted. 


HOMEWORK 
For the following exercises, identify the type of data that would be used to 
describe a response (quantitative discrete, quantitative continuous, or 


qualitative), and give an example of the data. 
Exercise: 


Problem: number of tickets sold to a concert 
Solution: 


quantitative discrete, 150 


Exercise: 


Problem: percent of body fat 


Exercise: 


Problem: favorite baseball team 
Solution: 


qualitative, Oakland A’s 


Exercise: 


Problem: time in line to buy groceries 


Exercise: 


Problem: number of students enrolled at Evergreen Valley College 


Solution: 
quantitative discrete, 11,234 students 


Exercise: 


Problem: most-watched television show 


Exercise: 


Problem: brand of toothpaste 


Solution: 


qualitative, Crest 


Exercise: 


Problem: distance to the closest movie theatre 


Exercise: 


Problem: age of executives in Fortune 500 companies 


Solution: 


quantitative continuous, 47.3 years 


Exercise: 
Problem: number of competing computer spreadsheet software packages 


Use the following information to answer the next two exercises: A study was 
done to determine the age, number of times per week, and the duration 
(amount of time) of resident use of a local park in San Jose. The first house in 
the neighborhood around the park was selected randomly and then every 8th 
house in the neighborhood around the park was interviewed. 

Exercise: 


Problem: “Number of times per week” is what type of data? 


a. qualitative (categorical) 
b. quantitative discrete 
c. quantitative continuous 


Solution: 


b 


Exercise: 


Problem: “Duration (amount of time)” is what type of data? 


a. qualitative (categorical) 
b. quantitative discrete 
c. quantitative continuous 


Exercise: 


Problem: 


Airline companies are interested in the consistency of the number of 
babies on each flight, so that they have adequate safety equipment. 
Suppose an airline conducts a survey. Over Thanksgiving weekend, it 
surveys six flights from Boston to Salt Lake City to determine the 
number of babies on the flights. It determines the amount of safety 
equipment needed by the result of that study. 


a. Using complete sentences, list three things wrong with the way the 
survey was conducted. 

b. Using complete sentences, list three ways that you would improve 
the survey if it were to be repeated. 


Solution: 


a. The survey was conducted using six similar flights. 
The survey would not be a true representation of the entire 
population of air travelers. 
Conducting the survey on a holiday weekend will not produce 
representative results. 

b. Conduct the survey during different times of the year. 
Conduct the survey using flights to and from various locations. 
Conduct the survey on different days of the week. 


Exercise: 
Problem: 
Suppose you want to determine the mean number of students per 


Statistics class in your state. Describe a possible sampling method in three 
to five complete sentences. Make the description detailed. 


Exercise: 
Problem: 
Suppose you want to determine the mean number of cans of soda drunk 
each month by students in their twenties at your school. Describe a 


possible sampling method in three to five complete sentences. Make the 
description detailed. 


Solution: 


Answers will vary. Sample Answer: You could use a systematic sampling 
method. Stop the tenth person as they leave one of the buildings on 
campus at 9:50 in the morning. Then stop the tenth person as they leave a 
different building on campus at 1:50 in the afternoon. 


Exercise: 
Problem: 
List some practical difficulties involved in getting accurate results from a 
telephone survey. 


Exercise: 


Problem: 


List some practical difficulties involved in getting accurate results from a 
mailed survey. 


Solution: 


Answers will vary. Sample Answer: Many people will not respond to 
mail surveys. If they do respond to the surveys, you can’t be sure who is 
responding. In addition, mailing lists can be incomplete. 


Exercise: 
Problem: 
With your classmates, brainstorm some ways you could overcome these 
problems if you needed to conduct a phone or mail survey. 
Exercise: 
Problem: 
The instructor takes her sample by gathering data on five randomly 


selected students from each Lake Tahoe Community College math class. 
The type of sampling she used is 


a. Cluster sampling 

b. stratified sampling 

c. simple random sampling 
d. convenience sampling 


Solution: 


b 


Exercise: 


Problem: 


A study was done to determine the age, number of times per week, and 
the duration (amount of time) of residents using a local park in San Jose. 
The first house in the neighborhood around the park was selected 
randomly and then every eighth house in the neighborhood around the 
park was interviewed. The sampling method was: 


a. simple random 
b. systematic 

c. stratified 

d. cluster 


Exercise: 


Problem: 
Name the sampling method used in each of the following situations: 


a. A woman in the airport is handing out questionnaires to travelers 
asking them to evaluate the airport’s service. She does not ask 
travelers who are hurrying through the airport with their hands full 
of luggage, but instead asks all travelers who are sitting near gates 
and not taking naps while they wait. 

b. A teacher wants to know if her students are doing homework, so she 
randomly selects rows two and five and then calls on all students in 
row two and all students in row five to present the solutions to 
homework problems to the class. 

c. The marketing manager for an electronics chain store wants 
information about the ages of its customers. Over the next two 
weeks, at each store location, 100 randomly selected customers are 
given questionnaires to fill out asking for information about age, as 
well as about other variables of interest. 

d. The librarian at a public library wants to determine what proportion 
of the library users are children. The librarian has a tally sheet on 
which she marks whether books are checked out by an adult or a 
child. She records this data for every fourth patron who checks out 
books. 


e. 


A political party wants to know the reaction of voters to a debate 
between the candidates. The day after the debate, the party’s polling 
staff calls 1,200 randomly selected phone numbers. If a registered 
voter answers the phone or is available to come to the phone, that 
registered voter is asked whom he or she intends to vote for and 
whether the debate changed his or her opinion of the candidates. 


Solution: 


convenience cluster stratified systematic simple random 


Exercise: 


Problem: 


A “random survey” was conducted of 3,274 people of the 
“microprocessor generation” (people born since 1971, the year the 
microprocessor was invented). It was reported that 48% of those 
individuals surveyed stated that if they had $2,000 to spend, they would 
use it for computer equipment. Also, 66% of those surveyed considered 
themselves relatively savvy computer users. 


a. 


b. 


Do you consider the sample size large enough for a study of this 
type? Why or why not? 

Based on your “gut feeling,” do you believe the percents accurately 
reflect the U.S. population for those individuals born since 1971? If 
not, do you think the percents of the population are actually higher 
or lower than the sample statistics? Why? 

Additional information: The survey, reported by Intel Corporation, 
was filled out by individuals who visited the Los Angeles 
Convention Center to see the Smithsonian Institute's road show 
called “America’s Smithsonian.” 


. With this additional information, do you feel that all demographic 


and ethnic groups were equally represented at the event? Why or 
why not? 


d. With the additional information, comment on how accurately you 


think the sample statistics reflect the population parameters. 


Exercise: 


Problem: 


The Well-Being Index is a survey that follows trends of U.S. residents on 
a regular basis. There are six areas of health and wellness covered in the 
survey: Life Evaluation, Emotional Health, Physical Health, Healthy 
Behavior, Work Environment, and Basic Access. Some of the questions 
used to measure the Index are listed below. 


Identify the type of data obtained from each question used in this survey: 
qualitative(categorical), quantitative discrete, or quantitative continuous. 


a. Do you have any health problems that prevent you from doing any 
of the things people your age can normally do? 

b. During the past 30 days, for about how many days did poor health 
keep you from doing your usual activities? 

c. In the last seven days, on how many days did you exercise for 30 
minutes or more? 

d. Do you have health insurance coverage? 


Solution: 


a. qualitative(categorical) 
b. quantitative discrete 
c. quantitative discrete 
d. qualitative(categorical) 


Exercise: 


Problem: 


In advance of the 1936 Presidential Election, a magazine titled Literary 
Digest released the results of an opinion poll predicting that the 
republican candidate Alf Landon would win by a large margin. The 
magazine sent post cards to approximately 10,000,000 prospective voters. 
These prospective voters were selected from the subscription list of the 
magazine, from automobile registration lists, from phone lists, and from 
club membership lists. Approximately 2,300,000 people returned the 
postcards. 


a. Think about the state of the United States in 1936. Explain why a 
sample chosen from magazine subscription lists, automobile 
registration lists, phone books, and club membership lists was not 
representative of the population of the United States at that time. 

b. What effect does the low response rate have on the reliability of the 
sample? 

c. Are these problems examples of sampling error or nonsampling 
error? 

d. During the same year, George Gallup conducted his own poll of 
30,000 prospective voters. These researchers used a method they 
called "quota sampling" to obtain survey answers from specific 
subsets of the population. Quota sampling is an example of which 
sampling method described in this module? 


Exercise: 


Problem: 


Crime-related and demographic statistics for 47 US states in 1960 were 
collected from government agencies, including the FBI's Uniform Crime 
Report. One analysis of this data found a strong connection between 
education and crime indicating that higher levels of education in a 
community correspond to higher crime rates. 


Which of the potential problems with samples discussed in [link] could 
explain this connection? 


Solution: 


Causality: The fact that two variables are related does not guarantee that 
one variable is influencing the other. We cannot assume that crime rate 
impacts education level or that education level impacts crime rate. 


Confounding: There are many factors that define a community other than 
education level and crime rate. Communities with high crime rates and 
high education levels may have other lurking variables that distinguish 
them from communities with lower crime rates and lower education 
levels. Because we cannot isolate these variables of interest, we cannot 
draw valid conclusions about the connection between education and 


crime. Possible lurking variables include police expenditures, 
unemployment levels, region, average age, and size. 


Exercise: 


Problem: 


YouPolls is a website that allows anyone to create and respond to polls. 
One question posted April 15 asks: 


“Do you feel happy paying your taxes when members of the Obama 
administration are allowed to ignore their tax liabilities?” (lastbaldeagle. 
2013. On Tax Day, House to Call for Firing Federal Workers Who Owe 
Back Taxes. Opinion poll posted online at: 

http://www. youpolls.com/details.aspx?id=12328 (accessed May 1, 
2013).) 


As of April 25, 11 people responded to this question. Each participant 
answered “NO!” 


Which of the potential problems with samples discussed in this module 
could explain this connection? 


Exercise: 


Problem: 
A scholarly article about response rates begins with the following quote: 


“Declining contact and cooperation rates in random digit dial (RDD) 
national telephone surveys raise serious concerns about the validity of 
estimates drawn from such research.” (Scott Keeter et al., “Gauging the 
Impact of Growing Nonresponse on Estimates from a National RDD 
Telephone Survey,” Public Opinion Quarterly 70 no. 5 (2006), 


2013).) 
The Pew Research Center for People and the Press admits: 
“The percentage of people we interview — out of all we try to interview — 


has been declining over the past decade or more.” (Frequently Asked 
Questions, Pew Research Center for the People & the Press, 


http://www.people-press.org/methodology/frequently-asked- 
questions/#dont-you-have-trouble-getting-people-to-answer-your-polls 
(accessed May 1, 2013).) 


a. What are some reasons for the decline in response rate over the past 
decade? 

b. Explain why researchers are concerned with the impact of the 
declining response rate on public opinion polls. 


Solution: 


a. Possible reasons: increased use of caller id, decreased use of 
landlines, increased use of private numbers, voice mail, privacy 
managers, hectic nature of personal schedules, decreased willingness 
to be interviewed 

b. When a large number of people refuse to participate, then the sample 
may not have the same characteristics of the population. Perhaps the 
majority of people willing to participate are doing so because they 
feel strongly about the subject of the survey. 


Glossary 


Cluster Sampling 
a method for selecting a random sample and dividing the population into 
groups (clusters); use simple random sampling to select a set of clusters. 
Every individual in the chosen clusters is included in the sample. 


Continuous Random Variable 
a random variable (RV) whose outcomes are measured; the height of 
trees in the forest is a continuous RV. 


Convenience Sampling 
a nonrandom method of selecting a sample; this method selects 


individuals that are easily accessible and may result in biased data. 


Discrete Random Variable 


a random variable (RV) whose outcomes are counted 


Nonsampling Error 
an issue that affects the reliability of sampling data other than natural 
variation; it includes a variety of human errors including poor study 
design, biased sampling methods, inaccurate information provided by 
study participants, data entry errors, and poor analysis. 


Qualitative Data 
See Data. 


Quantitative Data 
See Data. 


Random Sampling 
a method of selecting a sample that gives every member of the population 
an equal chance of being selected. 


Sampling Bias 
not all members of the population are equally likely to be selected 


Sampling Error 
the natural variation that results from selecting a sample to represent a 
larger population; this variation decreases as the sample size increases, so 
selecting larger samples reduces sampling error. 


Sampling with Replacement 
Once a member of the population is selected for inclusion in a sample, 
that member is returned to the population for the selection of the next 
individual. 


Sampling without Replacement 
A member of the population may be chosen for inclusion in a sample 
only once. If chosen, the member is not returned to the population before 
the next selection. 


Simple Random Sampling 
a straightforward method for selecting a random sample; give each 
member of the population a number. Use a random number generator to 


select a set of labels. These randomly selected labels identify the 
members of your sample. 


Stratified Sampling 
a method for selecting a random sample used to ensure that subgroups of 
the population are represented adequately; divide the population into 
groups (strata). Use simple random sampling to identify a proportionate 
number of individuals from each stratum. 


Systematic Sampling 
a method for selecting a random sample; list the members of the 
population. Use simple random sampling to select a starting point in the 
population. Let k = (number of individuals in the population)/(number of 
individuals needed in the sample). Choose every kth individual in the list 
starting with the one that was randomly selected. If necessary, return to 
the beginning of the population list to complete your sample. 


Levels of Measurement 


Once you have a set of data, you will need to organize it so that you can analyze how frequently 
each datum occurs in the set. However, when calculating the frequency, you may need to round 
your answers so that they are as precise as possible. 


Levels of Measurement 


The way a set of data is measured is called its level of measurement. Correct statistical procedures 
depend on a researcher being familiar with levels of measurement. Not every statistical operation 
can be used with every set of data. Data can be classified into four levels of measurement. They are 
(from lowest to highest level): 


Nominal scale level 
Ordinal scale level 
Interval scale level 
Ratio scale level 


Data that is measured using a nominal scale is qualitative (categorical). Categories, colors, 
names, labels and favorite foods along with yes or no responses are examples of nominal level 
data. Nominal scale data are not ordered. For example, trying to classify people according to their 
favorite food does not make any sense. Putting pizza first and sushi second is not meaningful. 


Smartphone companies are another example of nominal scale data. The data are the names of the 
companies that make smartphones, but there is no agreed upon order of these brands, even though 
people may have personal preferences. Nominal scale data cannot be used in calculations. 


Data that is measured using an ordinal scale is similar to nominal scale data but there is a big 
difference. The ordinal scale data can be ordered. An example of ordinal scale data is a list of the 
top five national parks in the United States. The top five national parks in the United States can be 
ranked from one to five but we cannot measure differences between the data. 


Another example of using the ordinal scale is a cruise survey where the responses to questions 
about the cruise are “excellent,” “good,” “satisfactory,” and “unsatisfactory.” These responses are 
ordered from the most desired response to the least desired. But the differences between two pieces 
of data cannot be measured. Like the nominal scale data, ordinal scale data cannot be used in 
calculations. 


Data that is measured using the interval scale is similar to ordinal level data because it has a 
definite ordering but there is a difference between data. The differences between interval scale data 
can be measured though the data does not have a starting point. 


Temperature scales like Celsius (C) and Fahrenheit (F) are measured by using the interval scale. In 
both temperature measurements, 40° is equal to 100° minus 60°. Differences make sense. But 0 
degrees does not because, in both scales, 0 is not the absolute lowest temperature. Temperatures 
like -10° F and -15° C exist and are colder than 0. 


Interval level data can be used in calculations, but one type of comparison cannot be done. 80° C is 
not four times as hot as 20° C (nor is 80° F four times as hot as 20° F). There is no meaning to the 


ratio of 80 to 20 (or four to one). 


Data that is measured using the ratio scale takes care of the ratio problem and gives you the most 
information. Ratio scale data is like interval scale data, but it has a 0 point and ratios can be 
calculated. For example, four multiple choice statistics final exam scores are 80, 68, 20 and 92 (out 
of a possible 100 points). The exams are machine-graded. 


The data can be put in order from lowest to highest: 20, 68, 80, 92. 
The differences between the data have meaning. The score 92 is more than the score 68 by 24 


points. Ratios can be calculated. The smallest score is 0. So 80 is four times 20. The score of 80 is 
four times better than the score of 20. 


Frequency 


Twenty students were asked how many hours they worked per day. Their responses, in hours, are 
as follows: 56332475235654435253. 


[link] lists the different data values in ascending order and their frequencies. 


Data value Frequency 
2 3 
a 5 
A 3 
5 6 
6 2 
7 1 


Frequency Table of Student Work Hours 


A frequency is the number of times a value of the data occurs. According to [link], there are three 
students who work two hours, five students who work three hours, and so on. The sum of the 
values in the frequency column, 20, represents the total number of students included in the sample. 


A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data 
occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, 
divide each frequency by the total number of students in the sample—in this case, 20. Relative 
frequencies can be written as fractions, percents, or decimals. 


Data value Frequency Relative frequency 


2 3 + or 0.15 
3 5 $y or 0.25 
4 3 $5 or 0.15 
5 6 # or 0.30 
6 2 = or 0.10 
7 1 3p oF 0.05 


Frequency Table of Student Work Hours with Relative Frequencies 


20 


59 » OF ip 


The sum of the values in the relative frequency column of [link] is 
Cumulative relative frequency is the accumulation of the previous relative frequencies. To find 
the cumulative relative frequencies, add all the previous relative frequencies to the relative 
frequency for the current row, as shown in [link]. 


Data value Frequency Relative frequency Cumulative relative frequency 
2 2 + or 0.15 0.15 

3 5 $y or 0.25 0.15 + 0.25 = 0.40 

4 3 35 or 0.15 0.40 + 0.15 = 0.55 

5 6 3h or 0.30 0.55 + 0.30 = 0.85 

6 2 + or 0.10 0.85 + 0.10 = 0.95 

7 1 3p OF 0.05 0.95 + 0.05 = 1.00 


Frequency Table of Student Work Hours with Relative and Cumulative Relative Frequencies 


The last entry of the cumulative relative frequency column is one, indicating that one hundred 
percent of the data has been accumulated. 


Note: 
NOTE 


Because of rounding, the relative frequency column may not always sum to one, and the last entry 
in the cumulative relative frequency column may not be one. However, they each should be close 


to one. 


[link] represents the heights, in inches, of a sample of 100 male semiprofessional soccer players. 


Heights 
(inches) 


59.95-61.95 
61.95-63.95 
63.95-65.95 
65.95-67.95 
67.95-69.95 
69.95—71.95 
71.95—73.95 


73.95—75.95 


Frequency Table of Soccer Player Height 


Frequency 


5 


17 


12 


Total = 
100 


Relative 
frequency 
sep = 0.05 
=35 = 0.03 
spy = 0.15 
<= 0.40 
3p = 0.17 
44 = 0.12 
=u = 0.07 
sor = 0.01 
Total = 1.00 


Cumulative relative 
frequency 


0.05 

0.05 + 0.03 = 0.08 
0.08 + 0.15 = 0.23 
0.23 + 0.40 = 0.63 
0.63 + 0.17 = 0.80 
0.80 + 0.12 = 0.92 
0.92 + 0.07 = 0.99 


0.99 + 0.01 = 1.00 


The data in this table have been grouped into the following intervals: 


59.95 to 61.95 inches 
61.95 to 63.95 inches 
63.95 to 65.95 inches 
65.95 to 67.95 inches 
67.95 to 69.95 inches 
69.95 to 71.95 inches 


e 71.95 to 73.95 inches 
e 73.95 to 75.95 inches 


In this sample, there are five players whose heights fall within the interval 59.95-61.95 inches, 
three players whose heights fall within the interval 61.95—63.95 inches, 15 players whose heights 
fall within the interval 63.95—65.95 inches, 40 players whose heights fall within the interval 65.95— 
67.95 inches, 17 players whose heights fall within the interval 67.95-69.95 inches, 12 players 
whose heights fall within the interval 69.95—71.95, seven players whose heights fall within the 
interval 71.95—73.95, and one player whose heights fall within the interval 73.95—75.95. All 
heights fall between the endpoints of an interval and not at the endpoints. 


Example: 
Exercise: 


Problem: From [link], find the percentage of heights that are less than 65.95 inches. 
Solution: 


If you look at the first, second, and third rows, the heights are all less than 65.95 inches. 
There are 5 + 3 + 15 = 23 players whose heights are less than 65.95 inches. The percentage 
of heights less than 65.95 inches is then a or 23%. This percentage is the cumulative 
relative frequency entry in the third row. 


Note: 
Try It 
Exercise: 


Problem: [link] shows the amount, in inches, of annual rainfall in a sample of towns. 


Rainfall Relative Cumulative relative 
(inches) Frequency frequency frequency 
2.95-4.97 6 # =0.12 0.12 

4.97-6.99 Z & = 0.14 0.12 + 0.14 = 0.26 
6.99-9.01 15 B = 0.30 0.26 + 0.30 = 0.56 


Rainfall Relative Cumulative relative 


(inches) Frequency frequency frequency 

9.01—11.03 8 s = 0.16 0.56 + 0.16 = 0.72 
11.03-13.05 a a = 0.18 0.72 + 0.18 = 0.90 
13.05-15.07 5 a = 0.10 0.90 + 0.10 = 1.00 


Total = 50 Total = 1.00 


From [link], find the percentage of rainfall that is less than 9.01 inches. 


Solution: 
Try It Solutions 


0.56 or 56% 


Example: 
Exercise: 


Problem: 
From [link], find the percentage of heights that fall between 61.95 and 65.95 inches. 
Solution: 


Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 0.18 or 18%. 


Note: 
Try It 
Exercise: 


Problem: From [link], find the percentage of rainfall that is between 6.99 and 13.05 inches. 


Solution: 
Try It Solutions 


0.30 + 0.16 + 0.18 = 0.64 or 64% 


Example: 
Exercise: 


Problem: 


Use the heights of the 100 male semiprofessional soccer players in [link]. Fill in the blanks 
and check your answers. 


a. The percentage of heights that are from 67.95 to 71.95 inches is:__. 

b. The percentage of heights that are from 67.95 to 73.95 inches is:__. 

c. The percentage of heights that are more than 65.95 inches is:_____ 

d. The number of players in the sample who are between 61.95 and 71.95 inches tall is: 


e. What kind of data are the heights? 
f. Describe how you could gather this data (the heights) so that the data are characteristic 
of all male semiprofessional soccer players. 


Remember, you count frequencies. To find the relative frequency, divide the frequency by 
the total number of data values. To find the cumulative relative frequency, add all of the 
previous relative frequencies to the relative frequency for the current row. 


Solution: 


a. 29% 

b. 36% 

©, TV 

d. 87 

e. quantitative continuous 

f. get rosters from each team and choose a simple random sample from each 


Example: 
Nineteen people were asked how many miles, to the nearest mile, they commute to work each day. 
The data are as follows: 25 732 1018 15 207 10185 12 13 1245 10. [link] was produced: 


Data Frequency Relative frequency Cumulative relative frequency 
3 3 a 0.1579 


4 i _ 0.2105 


Data Frequency Relative frequency Cumulative relative frequency 


20 


3 + 0.1579 
2 4 0.2632 
3 = 0.4737 
2 a 0.7895 
i ~ 0.8421 
1 1 0.8948 
1 iy 0.9474 
i + 1.0000 


Frequency of Commuting Distances 


Exercise: 


Problem: 


a. 
b. 


Is the table correct? If it is not correct, what is wrong? 
True or False: Three percent of the people surveyed commute three miles. If the 
statement is not correct, what should it be? If the table is incorrect, make the corrections. 


c. What fraction of the people surveyed commute five or seven miles? 


d. 


What fraction of the people surveyed commute 12 miles or more? Less than 12 miles? 
Between five and 13 miles (not including five and 13 miles)? 


Solution: 


a. 


b. 


d. 


Note: 


No. The frequency column sums to 18, not 19. Not all cumulative relative frequencies 
are correct. 
False. The frequency for three miles should be one; for two miles (left out), two. The 
cumulative relative frequency column should read: 0.1052, 0.1579, 0.2105, 0.3684, 
0.4737, 0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1.0000. 
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Try It 
Exercise: 


Problem: 


[link] represents the amount, in inches, of annual rainfall in a sample of towns. What fraction 
of towns surveyed get between 11.03 and 13.05 inches of rainfall each year? 


Solution: 
Try It Solutions 


2 
50 


Example: 
[link] contains the total number of deaths worldwide as a result of earthquakes for the period from 
2000 to 2012. 


Year Total number of deaths 
2000 231 
2001 21,357 
2002 11,685 
2003 33,819 
2004 228,802 
2005 88,003 
2006 6,605 
2007 712 
2008 88,011 
2009 1,790 


2010 320,120 


Year Total number of deaths 


2011 21,953 

2012 768 

Total 823,856 
Exercise: 


Problem: Answer the following questions. 


a. What is the frequency of deaths measured from 2006 through 2009? 

b. What percentage of deaths occurred after 2009? 

c. What is the relative frequency of deaths that occurred in 2003 or earlier? 

d. What is the percentage of deaths that occurred in 2004? 

e. What kind of data are the numbers of deaths? 

f. The Richter scale is used to quantify the energy produced by an earthquake. Examples 
of Richter scale numbers are 2.3, 4.0, 6.1, and 7.0. What kind of data are these numbers? 


Solution: 


a. 97,118 (11.8%) 

b. 41.6% 

c. 67,092/823,356 or 0.081 or 8.1 % 
d. 27.8% 

e. Quantitative discrete 

f. Quantitative continuous 


Note: 
Try It 
Exercise: 


Problem: 


{link] contains the total number of fatal motor vehicle traffic crashes in the United States for 
the period from 1994 to 2011. 


Year Total number of crashes Year Total number of crashes 


Year Total number of crashes Year Total number of crashes 


1994 36,254 2004 38,444 
1995 37,241 2005 39,252 
1996 37,494 2006 38,648 
oF, 37,324 2007 37,435 
1998 37,107 2008 34,172 
The ehe 37,140 2009 30,862 
2000 37,526 2010 30,296 
2001 37,862 2011 Sh sa! 
2002 38,491 Total 653,782 


2003 38,477 


Answer the following questions. 


a. What is the frequency of deaths measured from 2000 through 2004? 

b. What percentage of deaths occurred after 2006? 

c. What is the relative frequency of deaths that occurred in 2000 or before? 

d. What is the percentage of deaths that occurred in 2011? 

e. What is the cumulative relative frequency for 2006? Explain what this number tells you 
about the data. 


Solution: 
Try It Solutions 


a. 190,800 (29.2%) 

b. 24.9% 

c. 260,086/653,782 or 39.8% 

d. 4.6% 

e. 75.1% of all fatal traffic crashes for the period from 1994 to 2011 happened from 1994 
to 2006. 
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Chapter Review 


Some calculations generate numbers that are artificially precise. It is not necessary to report a value 
to eight decimal places when the measures that generated that value were only accurate to the 
nearest tenth. Round off your final answer to one more decimal place than was present in the 
original data. This means that if you have data measured to the nearest tenth of a unit, report the 
final statistic to the nearest hundredth. 


In addition to rounding your answers, you can measure your data using the following four levels of 
measurement. 


¢ Nominal scale level: data that cannot be ordered nor can it be used in calculations 

e Ordinal scale level: data that can be ordered; the differences cannot be measured 

e Interval scale level: data with a definite ordering but no starting point; the differences can be 
measured, but there is no such thing as a ratio. 

¢ Ratio scale level: data with a starting point that can be ordered; the differences have meaning 
and ratios can be calculated. 


When organizing data, it is important to know how many times a value appears. How many 
statistics students study five hours or more for an exam? What percent of families on our block 
own two pets? Frequency, relative frequency, and cumulative relative frequency are measures that 
answer questions like these. 


HOMEWORK 


Exercise: 


Problem: 


Fifty part-time students were asked how many courses they were taking this term. The 
(incomplete) results are shown below: 


# of Relative Cumulative relative 
courses Frequency frequency frequency 

1 30 0.6 

2 15 

a 


Part-time Student Course Loads 


a. Fill in the blanks in [link]. 
b. What percent of students take exactly two courses? 
c. What percent of students take one or two courses? 


Exercise: 
Problem: 


Sixty adults with gum disease were asked the number of times per week they used to floss 
before their diagnosis. The (incomplete) results are shown in [link]. 


# flossing per Relative Cumulative relative 
week Frequency frequency frequency 

0 27 0.4500 

1 18 

a 0.9333 


6 3 0.0500 


# flossing per Relative Cumulative relative 
week Frequency frequency frequency 


7 | 0.0167 


Flossing Frequency for Adults with Gum Disease 


a. Fill in the blanks in [link]. 
b. What percent of adults flossed six times per week? 
c. What percent flossed at most three times per week? 


Solution: 
a. 
# flossing per Relative Cumulative relative 
week Frequency frequency frequency 
0 27 0.4500 0.4500 
1 18 0.3000 0.7500 
2 11 0.1833 0.9333 
6 3 0.0500 0.9833 
rj 1 0.0167 1 
b. 5.00% 
Cc. 93.33% 
Exercise: 
Problem: 


Nineteen immigrants to the U.S were asked how many years, to the nearest year, they have 
lived in the U.S. The data are as follows: 25 722102015070 2051215124510. 


[link] was produced. 


Data 


20 


Frequency 


2 


Relative frequency 


Frequency of Immigrant Survey Responses 


Cumulative relative frequency 
0.1053 
0.2632 
0.3158 
0.4737 
0.5789 
0.6842 
0.7895 
0.8421 


1.0000 


a. Fix the errors in [link]. Also, explain how someone might have arrived at the incorrect 
number(s). 
b. Explain what is wrong with this statement: “47 percent of the people surveyed have lived 


in the U.S. for 5 years.” 


c. Fix the statement in b to make it correct. 

d. What fraction of the people surveyed have lived in the U.S. five or seven years? 

e. What fraction of the people surveyed have lived in the U.S. at most 12 years? 

f. What fraction of the people surveyed have lived in the U.S. fewer than 12 years? 

g. What fraction of the people surveyed have lived in the U.S. from five to 20 years, 
inclusive? 


Exercise: 


Problem: 


How much time does it take to travel to work? [link] shows the mean commute time by state 
for workers at least 16 years old who are not working at home. Find the mean travel time, and 


round off the answer properly. 


24.0 24.3 25.0 18.9 27.5 ig 21.8 20.9 16.7 27.3 
18.2 24.7 20.0 22.6 2a. 18.0 31.4 22.3 24.0 25.5 
24.7 24.6 28.1 24.9 22.6 23.6 23.4 207 24.8 25.5 
21,2 25./ 23.1 23.0 23.9 26.0 16.3 23.1 21.4 21.5 


27.0 27.0 18.6 ey 23.3 30.1 22.9 23.3 21.7 18.6 


Solution: 


The sum of the travel times is 1,173.1. Divide the sum by 50 to calculate the mean value: 
23.462. Because each state’s travel time was measured to the nearest tenth, round this 
calculation to the nearest hundredth: 23.46. 


Exercise: 
Problem: 
Forbes magazine published data on the best small firms in 2012. These were firms which had 
been publicly traded for at least a year, have a stock price of at least $5 per share, and have 


reported annual revenue between $5 million and $1 billion. [link] shows the ages of the chief 
executive officers for the first 60 ranked firms. 


Age Frequency Relative frequency Cumulative relative frequency 
40-44 3 

45-49 11 

50-54 13 

55-59 16 

60-64 10 

65-69 6 

70-74 1 


a. What is the frequency for CEO ages between 54 and 65? 
b. What percentage of CEOs are 65 years or older? 


c. What is the relative frequency of ages under 50? 
d. What is the cumulative relative frequency for CEOs younger than 55? 
e. Which graph shows the relative frequency and which shows the cumulative relative 


frequency? 
GraphaA Graph B 
1 1 
3 08 3 0.8 
5 5 
s 
F 0.6 F 0.6 
= ire 
$04 g 04 
8 8 
2 02 @ 0.2 
0 ) 
xy % ey &X ey rey 2 % % Py X & ey a 
Qa, Q ee Re | 
— No % % 4a, Xe Y %, NG wy % *e, SY 
CEO’s ages Age 


Use the following information to answer the next two exercises: [link] contains data on hurricanes 
that have made direct hits on the U.S. Between 1851 and 2004. A hurricane is given a strength 
category rating based on the minimum wind speed generated by the storm. 


Category Number of direct hits Relative frequency Cumulative frequency 
1 109 0.3993 0.3993 
2 72 0.2637 0.6630 
3 71 0.2601 
4 18 0.9890 
5 3 0.0110 1.0000 
Total = 273 


Frequency of Hurricane Direct Hits 


Exercise: 


Problem: What is the relative frequency of direct hits that were category 4 hurricanes? 


a. 0.0768 


b. 0.0659 
c. 0.2601 
d. Not enough information to calculate 


Solution: 


b 
Exercise: 


Problem: 
What is the relative frequency of direct hits that were AT MOST a category 3 storm? 


a. 0.3480 
b. 0.9231 
c. 0.2601 
d. 0.3370 


Glossary 


Cumulative Relative Frequency 
The term applies to an ordered set of observations from smallest to largest. The cumulative 
relative frequency is the sum of the relative frequencies for all values that are less than or 
equal to the given value. 


Frequency 
the number of times a value of the data occurs 


Relative Frequency 
the ratio of the number of times a value of the data occurs in the set of all outcomes to the 
number of all outcomes to the total number of outcomes 


Experimental Design and Ethics 


Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more 
effective at growing roses than another? Is fatigue as dangerous to a driver 
as the influence of alcohol? Questions like these are answered using 
randomized experiments. In this module, you will learn important aspects 
of experimental design. Proper study design ensures the production of 
reliable, accurate data. 


The purpose of an experiment is to investigate the relationship between two 
variables. When one variable causes change in another, we call the first 
variable the independent variable or explanatory variable. The affected 
variable is called the dependent variable or response variable: stimulus, 
response. In a randomized experiment, the researcher manipulates values of 
the explanatory variable and measures the resulting changes in the response 
variable. The different values of the explanatory variable are called 
treatments. An experimental unit is a single object or individual to be 
measured. 


You want to investigate the effectiveness of vitamin E in preventing 
disease. You recruit a group of subjects and ask them if they regularly take 
vitamin E. You notice that the subjects who take vitamin E exhibit better 
health on average than those who do not. Does this prove that vitamin E is 
effective in disease prevention? It does not. There are many differences 
between the two groups compared in addition to vitamin E consumption. 
People who take vitamin E regularly often take other steps to improve their 
health: exercise, diet, other vitamin supplements, choosing not to smoke. 
Any one of these factors could be influencing health. As described, this 
study does not prove that vitamin E is the key to disease prevention. 


Additional variables that can cloud a study are called lurking variables. In 
order to prove that the explanatory variable is causing a change in the 
response variable, it is necessary to isolate the explanatory variable. The 
researcher must design her experiment in such a way that there is only one 
difference between groups being compared: the planned treatments. This is 
accomplished by the random assignment of experimental units to 
treatment groups. When subjects are assigned treatments randomly, all of 
the potential lurking variables are spread equally among the groups. At this 


point the only difference between groups is the one imposed by the 
researcher. Different outcomes measured in the response variable, therefore, 
must be a direct result of the different treatments. In this way, an 
experiment can prove a cause-and-effect connection between the 
explanatory and response variables. 


The power of suggestion can have an important influence on the outcome of 
an experiment. Studies have shown that the expectation of the study 
participant can be as important as the actual medication. In one study of 
performance-enhancing drugs, researchers noted: 


Results showed that believing one had taken the substance resulted in 
[performance] times almost as fast as those associated with consuming the 
drug itself. In contrast, taking the drug without knowledge yielded no 
significant performance increment. (McClung, M. Collins, D. “Because I 
know it will!”: placebo effects of an ergogenic aid on athletic performance. 
Journal of Sport & Exercise Psychology. 2007 Jun. 29(3):382-94. Web. 
April 30, 2013.) 


When participation in a study prompts a physical response from a 
participant, it is difficult to isolate the effects of the explanatory variable. To 
counter the power of suggestion, researchers set aside one treatment group 
as a control group. This group is given a placebo treatment—a treatment 
that cannot influence the response variable. The control group helps 
researchers balance the effects of being in an experiment with the effects of 
the active treatments. Of course, if you are participating in a study and you 
know that you are receiving a pill which contains no actual medication, then 
the power of suggestion is no longer a factor. Blinding in a randomized 
experiment preserves the power of suggestion. When a person involved in a 
research study is blinded, he does not know who is receiving the active 
treatment(s) and who is receiving the placebo treatment. A double-blind 
experiment is one in which both the subjects and the researchers involved 
with the subjects are blinded. 


Example: 
Exercise: 


Problem: 


The Smell & Taste Treatment and Research Foundation conducted a 
study to investigate whether smell can affect learning. Subjects 
completed mazes multiple times while wearing masks. They 
completed the pencil and paper mazes three times wearing floral- 
scented masks, and three times with unscented masks. Participants 
were assigned at random to wear the floral mask during the first three 
trials or during the last three trials. For each trial, researchers recorded 
the time it took to complete the maze and the subject’s impression of 
the mask’s scent: positive, negative, or neutral. 


a. Describe the explanatory and response variables in this study. 

b. What are the treatments? 

c. Identify any lurking variables that could interfere with this study. 
d. Is it possible to use blinding in this study? 


Solution: 


a. The explanatory variable is scent, and the response variable is 
the time it takes to complete the maze. 

b. There are two treatments: a floral-scented mask and an unscented 
mask. 

c. All subjects experienced both treatments. The order of treatments 
was randomly assigned so there were no differences between the 
treatment groups. Random assignment eliminates the problem of 
lurking variables. 

d. Subjects will clearly know whether they can smell flowers or 
not, so subjects cannot be blinded in this study. Researchers 
timing the mazes can be blinded, though. The researcher who is 
observing a subject will not know which mask is being worn. 
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Chapter Review 


A poorly designed study will not produce reliable data. There are certain 
key components that must be included in every experiment. To eliminate 
lurking variables, subjects must be assigned randomly to different treatment 
groups. One of the groups must act as a control group, demonstrating what 
happens when the active treatment is not applied. Participants in the control 
group receive a placebo treatment that looks exactly like the active 
treatments but cannot influence the response variable. To preserve the 
integrity of the placebo, both researchers and subjects may be blinded. 
When a study is designed properly, the only difference between treatment 
groups is the one imposed by the researcher. Therefore, when groups 
respond differently to different treatments, the difference must be due to the 
influence of the explanatory variable. 


“An ethics problem arises when you are considering an action that benefits 
you or some cause you support, hurts or reduces benefits to others, and 
violates some rule.” (Andrew Gelman, “Open Data and Open Methods,” 
Ethics and Statistics, 
http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics1.p 


df (accessed May 1, 2013).) Ethical violations in statistics are not always 
easy to spot. Professional associations and federal agencies post guidelines 
for proper conduct. It is important that you learn basic statistical procedures 
so that you can recognize proper data analysis. 


Glossary 


Explanatory Variable 
the independent variable in an experiment; the value controlled by 
researchers 


Treatments 
different values or components of the explanatory variable applied in 
an experiment 


Response Variable 
the dependent variable in an experiment; the value that is measured 
for change at the end of an experiment 


Experimental Unit 
any individual or object to be measured 


Lurking Variable 
a variable that has an effect on a study even though it is neither an 
explanatory variable nor a response variable 


Random Assignment 
the act of organizing experimental units into treatment groups using 
random methods 


Control Group 
a group in a randomized experiment that receives an inactive treatment 
but is otherwise managed exactly as the other groups 


Informed Consent 
Any human subject in a research study must be cognizant of any risks 
or costs associated with the study. The subject has the right to know 
the nature of the treatments included in the study, their potential risks, 


and their potential benefits. Consent must be given freely by an 
informed, fit participant. 


Institutional Review Board 
a committee tasked with oversight of research programs that involve 
human subjects 


Placebo 
an inactive treatment that has no real effect on the explanatory variable 


Blinding 
not telling participants which treatment a subject is receiving 


Double-blinding 
the act of blinding both the subjects of an experiment and the 
researchers who work with the subjects 


Introduction 
class="introduction" 
When you 
have large 
amounts 
of data, 
you will 
need to 
organize 
itina 
way that 
makes 
sense. 
These 
ballots 
from an 
election 
are rolled 
together 
with 
similar 
ballots to 
keep them 
organized 
. (credit: 
William 
Greeson) 


Once you have collected data, what will you do with it? Data can be 
described and presented in many different formats. For example, suppose 
you are interested in buying a house in a particular area. You may have no 
clue about the house prices, so you might ask your real estate agent to give 
you a sample data set of prices. Looking at all the prices in the sample often 
is overwhelming. A better way might be to look at the median price and the 
variation of prices. The median and variation are just two ways that you 
will learn to describe data. Your agent might also provide you with a graph 
of the data. 


In this chapter, you will study numerical and graphical ways to describe and 
display your data. This area of statistics is called "Descriptive Statistics." 
You will learn how to calculate, and even more importantly, how to 
interpret these measurements and graphs. 


A Statistical graph is a tool that helps you learn about the shape or 
distribution of a sample or a population. A graph can be a more effective 
way of presenting data than a mass of numbers because we can see where 
data clusters and where there are only a few data values. Newspapers and 


the Internet use graphs to show trends and to enable readers to compare 
facts and figures quickly. Statisticians often graph data first to get a picture 
of the data. Then, more formal tools may be applied. 


Some of the types of graphs that are used to summarize and organize data 
are the dot plot, the bar graph, the histogram, the stem-and-leaf plot, the 
frequency polygon (a type of broken line graph), the pie chart, and the box 
plot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs, 
and bar graphs, as well as frequency polygons, and time series graphs. Our 
emphasis will be on histograms and box plots. 


Display Data 


Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs 


One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis. It is a 
good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. 
The leaf consists of a final significant digit. For example, 23 has stem two and leaf three. The number 432 has 
stem 43 and leaf two. Likewise, the number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and 
leaf three. Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. 
Then write the leaves in increasing order next to their corresponding stem. 


Example: 

For Susan Dean's spring pre-calculus class, scores for the first exam were as follows (smallest to largest): 

BiSg Ae alee lee tah ISISP Iiag (ILS (oR (Gy7 2 (Glee (eter? (SSR (else ws Wee J's Fishy tXOR tela tote tetey tore ClO ple yale te yls (eyalo ele yep 
100 


Stem Leaf 

3 3 

4 ZES 

5 305) 

6 1378899 
7 2348 

8 03888 

3 0244446 
10 0 


Stem-and-Leaf Graph 


The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately 
26% (3) were in the 90s or 100, a fairly high number of As. 


Note: 
Try It 
Exercise: 


Problem: 


For the Park City basketball team, scores for the last 30 games were as follows (smallest to largest): 

BD We (sisi syle shoe alO al yis ale alot alale akee “lye alas abeys abel ales alge 0) S(O Illy Iss bye ye IS yah Syl 5iGp Isy7/2 ISy7/e 
60; 61 

Construct a stem plot for the data. 


Solution: 
Stem Leaf 
3 22348 
4 022346778889 
5 00122234677 
6 01 


The stemplot is a quick way to graph data and gives an exact picture of the data. You want to look for an overall 
pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes 
called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers 
are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something 
unusual is happening. It takes some background information to explain outliers, so we will cover them in more 
detail later. 


Example: 

The data are the distances (in kilometers) from a home to local supermarkets. Create a stemplot using the data: 
Ilsile aise 2.3 Digg De shoe 3) Be ays Shy ahiee aL gp al pe abise abige al 7oalioe lyse oe (or (de 123} 

Exercise: 


Problem: Do the data seem to have any concentration of values? 
Note: 


NOTE 
The leaves are to the right of the decimal. 


Solution: 


The value 12.3 may be an outlier. Values appear to concentrate at three and four kilometers. 


Stem Leaf 


il 15 
2, 38 7 
3 23358 
4 025578 
5 56 
6 57 
7 
8 
9 
10 
11 
12 3 

Note: 

Try It 

Exercise: 

Problem: 


The following data show the distances (in miles) from the homes of off-campus statistics students to the 
college. Create a stem plot using the data and identify any outliers: 


OSPR OE 72 Iie 28 122 IL8e Waa ise iSe i.7e L772 ise ig DOR Bowe Dele Diop Doe Dee aise Bh ley shiek al alo val 3} 
Al Wp 15) we |S) 5p Sy 78 iy tek feh{0) 


Solution: 
Stem Leaf 
0 57 
1 12233557789 


2 0256888 


Stem Leaf 


3 58 

4 489 
5 2578 
6 

7 

8 0 


The value 8.0 may be an outlier. Values appear to concentrate at one and two miles. 


Example: 
Exercise: 


Problem: 


A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. In a side-by- 
side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to the left and the right of the 
stems. [link] and [link] show the ages of presidents at their inauguration and at their death. Construct a side- 


by-side stem-and-leaf plot using this data. 


Solution: 


Ages at Inauguration 
998777632 
8777766655554444422111110 


9854421110 


Ages at Death 

69 

366778 
003344567778 
0011147889 
01358 


0033 


President Age President Age President Age 


Washington 57 Lincoln 52 Hoover 54 
J. Adams 61 A. Johnson 56 F. Roosevelt 51 
Jefferson 57 Grant 46 Truman 60 
Madison 57 Hayes 54 Eisenhower 62 
Monroe 58 Garfield 49 Kennedy 43 
J. Q. Adams 57 Arthur 51 L. Johnson 55 
Jackson 61 Cleveland 47 Nixon 56 
Van Buren 54 B. Harrison 55 Ford 61 
W. H. Harrison 68 Cleveland 55 Carter 52 
Tyler 51 McKinley 54 Reagan 69 
Polk 49 T. Roosevelt 42 G.H.W. Bush 64 
Taylor 64 Taft 51 Clinton 47 
Fillmore 50 Wilson 56 G. W. Bush 54 
Pierce 48 Harding 55 Obama 47 
Buchanan 65 Coolidge 51 


Presidential Ages at Inauguration 


President Age President Age President Age 
Washington 67 Lincoln 56 Hoover 90 
J. Adams 90 A. Johnson 66 F. Roosevelt 63 
Jefferson 83 Grant 63 Truman 88 
Madison 85 Hayes 70 Eisenhower 78 
Monroe 73 Garfield 49 Kennedy 46 
J. Q. Adams 80 Arthur 56 L. Johnson 64 


Jackson 78 Cleveland 71 Nixon 81 


President 

Van Buren 

W. H. Harrison 
Tyler 

Polk 

Taylor 
Fillmore 
Pierce 
Buchanan 


Presidential Age at Death 


Age 
79 
68 
71 
53 
65 
74 
64 


77 


President 
B. Harrison 
Cleveland 
McKinley 
T. Roosevelt 
Taft 

Wilson 
Harding 


Coolidge 


Age President Age 
67 Ford 93 
vail Reagan 93 
58 

60 

72 

67 

57 


60 


Another type of graph that is useful for specific data values is a line graph. In the particular line graph shown in 
[link], the x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency 


points. The frequency points are connected using line segments. 


Example: 


In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his or her 
chores. The results are shown in [link] and in [link]. 


Number of times teenager is reminded 


0 


1 


Frequency 
2 
5 
8 


14 


Frequency 


0 1 2 3 4 5 6 
Number of times teenager is reminded 


Note: 
Try It 
Exercise: 


Problem: 


In a survey, 40 people were asked how many times per year they had their car in the shop for repairs. The 
results are shown in [link]. Construct a line graph. 


Number of times in shop Frequency 
0 7 
1 10 
2 14 
3 9 
Solution: 
16 
14 
12 
> 
2 10 
3 8 
io” 
OG 
irs 
4 
2 
0 
) 1 2 3 


Number of times in shop 


Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be 
rectangular boxes (used in three-dimensional plots), and they can be vertical or horizontal. The bar graph shown 
in [link] has age groups represented on the x-axis and proportions on the y-axis. 


Example: 
Exercise: 


Problem: 
By the end of 2011, Facebook had over 146 million users in the United States. [link] shows three age groups, 


the number of users in each age group, and the proportion (%) of users in each age group. Construct a bar 
graph using this data. 


Age groups Number of Facebook users Proportion (%) of Facebook users 
13-25 65,082,280 45% 
26-44 53,300,200 36% 


45-64 27,885,100 19% 


Solution: 
50 


45 
40 
35 
30 


Proportion (%) 
nN 
ua 


13-25 26-44 45-64 
Ages 


Note: 
Try It 
Exercise: 


Problem: 
The population in Park City is made up of children, working-age adults, and retirees. [link] shows the three 


age groups, the number of people in the town from each age group, and the proportion (%) of people in each 
age group. Construct a bar graph showing the proportions. 


Age groups Number of people Proportion of population 


Age groups Number of people 


Children 67,059 
Working-age adults 152,198 


Retirees 131,662 


Solution: 
50% 
45% 
40% 
35% 
30% 
25% 
20% 
15% 
10% 
5% 
0% 


Proportion (%) 


Children Working-age adults Retirees 
Age group 


Example: 
Exercise: 


Problem: 


Proportion of population 


19% 


43% 


38% 


The columns in [link] contain: the race or ethnicity of students in U.S. Public Schools for the class of 2011, 
percentages for the Advanced Placement examine population for that class, and percentages for the overall 

student population. Create a bar graph with the student race or ethnicity (qualitative data) on the x-axis, and 
the Advanced Placement examinee population percentages on the y-axis. 


Race/ethnicity 


1 = Asian, Asian American or Pacific 
Islander 


2 = Black or African American 

3 = Hispanic or Latino 

4 = American Indian or Alaska Native 
5 = White 


6 = Not reported/other 


AP examinee 
population 


10.3% 


9.0% 
17.0% 
0.6% 
57.1% 


6.0% 


Overall student 
population 


5.7% 


14.7% 
17.6% 
1.1% 

59.2% 


1.7% 


Solution: 


Percent of AP examinees 


1. 2 3 4 5 6 
Race/Ethnicity 


Note: 
Try It 
Exercise: 


Problem: 
Park city is broken down into six voting districts. The table shows the percent of the total registered voter 


population that lives in each district as well as the percent total of the entire population that lives in each 
district. Construct a bar graph that shows the registered voter population by district. 


District Registered voter population Overall city population 
1 15.5% 19.4% 

2 12.2% 15.6% 

3 9.8% 9.0% 

4 17.4% 18.5% 

5 22.8% 20.7% 

6 22.3% 16.8% 


Solution: 


25.0% 


20.0% 


15.0% 


10.0% 


5.0% 


Voter Proportion (%) 


0.0% 


District 


Example: 
Exercise: 


Problem: Below is a two-way table showing the types of pets owned by men and women: 


Dogs Cats Fish Total 
Men 4 2 2 8 
Women 4 6 2 12 
Total 8 8 4 20 


Given these data, calculate the conditional distributions for the subpopulation of men who own each pet type. 
Solution: 

Men who own dogs = 4/8 = 0.5 

Men who own cats = 2/8 = 0.25 

Men who own fish = 2/8 = 0.25 


Note: The sum of all of the conditional distributions must equal one. In this case, 0.5 + 0.25 + 0.25 = 1; 
therefore, the solution "checks". 


Histograms, Frequency Polygons, and Time Series Graphs 


For most of the work you do in this book, you will use a histogram to display the data. One advantage of a 
histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set 
consists of 100 values or more. 


A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The 
horizontal axis is labeled with what the data represents (for instance, distance from your home to school). The 
vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph 


will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, 
the center, and the spread of the data. 


The relative frequency is equal to the frequency for an observed value of the data divided by the total number of 
data values in the sample.(Remember, frequency is defined as the number of times an answer occurs.) If: 


e f= frequency 
e n= total number of data values (or the sum of the individual frequencies), and 
e RF = relative frequency, 


then: 
Equation: 


rF= + 
n 


For example, if three students in Mr. Ahab's English class of 40 students received from 90% to 100%, then, f= 3, n 
= 40, and RF = £ = aa = 0.075. 7.5% of the students received 90—100%. 90—100% are quantitative measures. 


To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many 
histograms consist of five to 15 bars or classes for clarity. The number of bars needs to be chosen. Choose a 
starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower 
value carried out to one more decimal place than the value with the most decimal places. For example, if the value 
with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 — 0.05 = 
6.05). We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value 
is 1.5, a convenient starting point is 1.495 (1.5 — 0.005 = 1.495). If the value with the most decimal places is 3.234 
and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 — 0.0005 = 0.9995). If all the data happen to 
be integers and the smallest value is two, then a convenient starting point is 1.5 (2 — 0.5 = 1.5). Also, when the 
starting point and other boundaries are carried to one additional decimal place, no data value will fall on a 
boundary. The next two examples go into detail about how to construct a histogram using continuous data and how 
to create a histogram using discrete data. 


Example: 

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. 
The heights are continuous data, since height is measured. 

60; 60.5; 61; 61; 61.5 

G325405.580525 

64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5 

66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 67; 
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74 

The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we 
want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient 
numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point. 

60 — 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95. 
The largest value is 74, so 74 + 0.05 = 74.05 is the ending value. 

Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the 
ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you 
choose eight bars. 

Equation: 


74.05 — 59.95 


= 1.76 
8 


Note: 

NOTE 

We will round up to two and make each bar or class interval two units wide. Rounding up to two is one way to 
prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes 
against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline 
that is followed by some for the width of a bar or class interval is to take the square root of the number of data 
values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, 
take the square root of 150 and round to 12 bars or intervals. 


The boundaries are: 


O BE) 

© 59.95'+ 2 = 61.95 
e 61.95 + 2=63.95 
© 63.95 + 2 = 65.95 
e 65.95 + 2 = 67.95 
e 67.95 + 2 = 69.95 
e 69.9552 —71.95 
e 71.95 + 2 = 73.95 


e 73.95 + 2 = 75.95 


The heights 60 through 61.5 inches are in the interval 59.95-61.95. The heights that are 63.5 are in the interval 
61.95-63.95. The heights that are 64 through 64.5 are in the interval 63.95-65.95. The heights 66 through 67.5 are 
in the interval 65.95-67.95. The heights 68 through 69.5 are in the interval 67.95-69.95. The heights 70 through 
71 are in the interval 69.95—71.95. The heights 72 through 73.5 are in the interval 71.95—73.95. The height 74 is 
in the interval 73.95—75.95. 


The following histogram displays the heights on the x-axis and relative frequency on the y-axis. 
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Heights 


Note: 
Try It 
Exercise: 


Problem: 


The following data are the shoe sizes of 50 male students. The sizes are continuous data since shoe size is 
measured. Construct a histogram and calculate the width of each bar or class interval. Suppose you choose 
six bars. 
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Solution: 

Smallest value: 9 

Largest value: 14 

Convenient starting value: 9 — 0.05 = 8.95 


Convenient ending value: 14 + 0.05 = 14.05 


14.05—8.95 __ 
1405-895 — 0.85 


The calculations suggests using 0.85 as the width of each bar or class interval. You can also use an interval 
with a width equal to one. 


Example: 
Create a histogram for the following data: the number of books bought by 50 part-time college students at ABC 
College. The number of books is discrete data, since books are counted. 
He a beg eS eo a es er LL 

ORs 


Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six students buy 
four books. Five students buy five books. Two students buy six books. 

Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest data value. 
Then the starting point is 0.5 and the ending value is 6.5. 

Exercise: 


Problem: 


Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many 
different values, a width that places the data values in the middle of the bar or class interval is the most 
convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6, and the starting point is 0.5, a width of one 
places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 1.5 to 2.5, 


the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the interval from to 
, the 5 in the middle of the interval from to , and the in the middle of the 
interval from to 
Solution: 
e 3.5 to 4.5 
e 45 to 5.5 
° 6 


e 5.5to 6.5 


Calculate the number of bars as follows: 
Equation: 
6.5 — 0.5 _ 
number of bars 
where 1 is the width of a bar. Therefore, bars = 6. 


The following histogram displays the number of books on the x-axis and the frequency on the y-axis. 
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Frequency 


0.5 1.5 2.5 3.5 4.5 55 6.5 
Number of books 


Example: 
Exercise: 


Problem: Using this data set, construct a histogram. 


Number of hours my classmates spent playing video games on weekends 


9.95 10 2.25 16.75 0 
19.5 LASS V5 15 1275 
5.5 11 10 20.75 WED 
23 ZAeS 24 23.75 18 

20 15 BASS) 18.8 20.5 


Solution: 


Hours Spent Playing Video Games 
on Weekends 


R 
fo) 


Number of students 
OrPNWA UATDN WO O 


0 5 10 15 20 25 
Number of hours 


Some values in this data set fall on boundaries for the class intervals. A value is counted in a class interval if 
it falls on the left boundary, but not if it falls on the right boundary. Different researchers may set up 
histograms for the same data in different ways. There is more than one correct way to set up a histogram. 


Frequency Polygons 


Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to 
interpret, so too do frequency polygons. 


To construct a frequency polygon, first examine the data and decide on the number of intervals, or class intervals, 
to use on the x-axis and y-axis. After choosing the appropriate ranges, begin plotting the data points. After all the 
points are plotted, draw line segments to connect them. 


Example: 
A frequency polygon was constructed from the frequency table below. 


Frequency distribution for calculus final test scores 


Lower bound Upper bound Frequency Cumulative frequency 
49.5 59.5 5 5 

59.5 69.5 10 15 

69.5 79.5 30 45 

79.5 89.5 40 85 


89.5 99.5 15 100 


Test Scores 


Frequency 


445 54.5 64.5 74.5 84.5 94.5 104.5 
Scores 

The first label on the x-axis is 44.5. This represents an interval extending from 39.5 to 49.5. Since the lowest test 
score is 54.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 54.5 represents the 
next interval, or the first “real” interval from the table, and contains five scores. This reasoning is followed for 
each of the remaining intervals with the point 104.5 representing the interval from 99.5 to 109.5. Again, this 
interval contains no data and is only used so that the graph will touch the x-axis. Looking at the graph, we say that 
this distribution is skewed because one side of the graph does not mirror the other side. 


Note: 
Try It 
Exercise: 


Problem: Construct a frequency polygon of U.S. Presidents’ ages at inauguration shown in [link]. 


Age at inauguration Frequency 
41.5-46.5 4 
46.5-51.5 11 
51.5-56.5 14 
56.5-61.5 ¢g 
61.5-66.5 4 
66.5-71.5 2 
Solution: 


The first label on the x-axis is 39. This represents an interval extending from 36.5 to 41.5. Since there are no 
ages less than 41.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 44 
represents the next interval, or the first “real” interval from the table, and contains four scores. This 
reasoning is followed for each of the remaining intervals with the point 74 representing the interval from 
71.5 to 76.5. Again, this interval contains no data and is only used so that the graph will touch the x-axis. 
Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror 
the other side. 


Frequency 


President’s Age at Inauguration 


Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons 
drawn for different data sets. 


Example: 


We will construct an overlay frequency polygon comparing the scores from [link] with the students’ final numeric 


grade. 


Frequency distribution for calculus final test scores 


Lower bound 


49.5 


59.5 


69.5 


79.5 


Frequency distribution for calculus final grades 


Lower bound 


49.5 


59.5 


69.5 


Upper bound 


59.5 


69.5 


79.5 


89.5 


99.5 


Upper bound 


59.5 


69.5 


79.5 


Frequency 
5 

10 

30 

40 


15 


Frequency 
10 
10 


30 


Cumulative frequency 
5 

15 

45 

85 


100 


Cumulative frequency 
10 
20 


50 


Frequency distribution for calculus final grades 


Lower bound Upper bound Frequency Cumulative frequency 
79.5 89.5 45 95 
89.5 99.5 5 100 


Final Test Grade v Final Grade 


Frequency 
N 
a 


445 545 645 745 845 94.5 104.5 
Grades 


Constructing a Time Series Graph 


Suppose that we want to study the temperature range of a region for an entire month. Every day at noon we note 
the temperature and write this down in a log. A variety of statistical studies could be done with these data. We 
could find the mean or the median temperature for the month. We could construct a histogram displaying the 
number of days that temperatures reach a certain range of values. However, all of these methods ignore a portion 
of the data that we have collected. 


One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature 
reading for the day, we don‘t have to think of the data as being random. We can instead use the times given to 
impose a chronological order on the data. A graph that recognizes this ordering and displays the changing 
temperature as the month progresses is called a time series graph. 


To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard 
Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is 
used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph 
correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in 
the order in which they occur. 


Example: 
Exercise: 


Problem: 


The following data shows the Annual Consumer Price Index, each month, for ten years. Construct a time 
series graph for the Annual Consumer Price Index data only. 


Year 


2003 


2004 


2005 


2006 


2007 


2008 


2009 


2010 


2011 


2012 


Year 


2003 


2004 


2005 


2006 


2007 


2008 


2009 


2010 


2011 


2012 


Solution: 


Jan 
181.7 
185.2 
190.7 
198.3 
202.416 
211.080 
211.143 
216.687 
220.223 


226.665 


Aug 
184.6 
189.5 
196.4 
203.9 
207.917 
219.086 
215.834 
218.312 
226.545 


230.379 


Feb 
183.1 
186.2 
191.8 
198.7 
203.499 
211.693 
212.193 
216.741 
221.309 


227.663 


Sep 
185.2 
189.9 
198.8 
202.9 
208.490 
218.783 
215.969 
218.439 
226.889 


231.407 


Mar 


184.2 


187.4 


193.3 


199.8 


205.352 


213.528 


212.709 


217.631 


223.467 


229.392 


Oct 


185.0 


190.9 


199.2 


201.8 


Apr 
183.8 
188.0 
194.6 
201.5 
206.686 
214.823 
213.240 
218.009 
224.906 


230.085 


208.936 


216.573 


216.177 


218.711 


226.421 


231.317 


May 
183.5 
189.1 
194.4 
202.5 
207.949 
216.632 
213.856 
218.178 
225.964 


229.815 


Nov 
184.5 
191.0 
197.6 
201.5 
210.177 
212.425 
216.330 
218.803 
226.230 


230.221 


Jun 


183.7 


189.7 


194.5 


202.9 


208.352 


218.815 


215.693 


217.965 


225.722 


229.478 


Dec 


184.3 


190.3 


196.8 


201.8 


210.036 


210.228 


215.949 


219.179 


225.672 


229.601 


Jul 


183.9 


189.4 


195.4 


203.5 


208.299 


219.964 


215.351 


218.011 


225.922 


229.104 


Annual 


184.0 


188.9 


195.3 


201.6 


207.342 


215.303 


214.537 


218.056 


224.939 


229.594 


Annual CPI 


Annual consumer 
price index 
nN 
b 
Oo 
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Year 


Note: 
Try It 
Exercise: 


Problem: 


The following table is a portion of a data set from www.worldbank.org. Use the table to construct a time 
series graph for CO emissions for the United States. 


CO, emissions 


Year Ukraine United Kingdom United States 
2003 352,259 540,640 5,681,664 
2004 343,121 540,409 5,790,761 
2005 339,029 541,990 5,826,394 
2006 B27 IS 542,045 5,737,615 
2007 328,357 528,631 5,828,697 
2008 323,657 522,247 5,656,839 
2009 272,176 474,579 5,299,563 
Solution 


US CO, Emissions 


CO, emissions in kt (millions) 


2003 2004 2005 2006 2007 2008 2009 


Uses of a Time Series Graph 


Time series graphs are important tools in various applications of statistics. When recording values of the same 
variable over an extended period of time, sometimes it is difficult to discern any trend or pattern. However, once 
the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to 
spot. 


How NOT to Lie with Statistics 


It is important to remember that the very reason we develop a variety of methods to present data is to develop 
insights into the subject of what the observations represent. We want to get a "sense" of the data. Are the 
observations all very much alike or are they spread across a wide range of values, are they bunched at one end of 
the spectrum or are they distributed evenly and so on. We are trying to get a visual picture of the numerical data. 
Shortly we will develop formal mathematical measures of the data, but our visual graphical presentation can say 
much. It can, unfortunately, also say much that is distracting, confusing and simply wrong in terms of the 
impression the visual leaves. Many years ago Darrell Huff wrote the book How to Lie with Statistics. It has been 
through 25 plus printings and sold more than one and one-half million copies. His perspective was a harsh one and 
used many actual examples that were designed to mislead. He wanted to make people aware of such deception, but 
perhaps more importantly to educate so that others do not make the same errors inadvertently. 


Again, the goal is to enlighten with visuals that tell the story of the data. Pie charts have a number of common 
problems when used to convey the message of the data. Too many pieces of the pie overwhelm the reader. More 
than perhaps five or six categories ought to give an idea of the relative importance of each piece. This is after all 
the goal of a pie chart, what subset matters most relative to the others. If there are more components than this then 
perhaps an alternative approach would be better or perhaps some can be consolidated into an "other" category. Pie 
charts cannot show changes over time, although we see this attempted all too often. In federal, state, and city 
finance documents pie charts are often presented to show the components of revenue available to the governing 
body for appropriation: income tax, sales tax motor vehicle taxes and so on. In and of itself this is interesting 
information and can be nicely done with a pie chart. The error occurs when two years are set side-by-side. Because 
the total revenues change year to year, but the size of the pie is fixed, no real information is provided and the 
relative size of each piece of the pie cannot be meaningfully compared. 


Histograms can be very helpful in understanding the data. Properly presented, they can be a quick visual way to 
present probabilities of different categories by the simple visual of comparing relative areas in each category. Here 
the error, purposeful or not, is to vary the width of the categories. This of course makes comparison to the other 
categories impossible. It does embellish the importance of the category with the expanded width because it has a 
greater area, inappropriately, and thus visually "says" that that category has a higher probability of occurrence. 


Time series graphs perhaps are the most abused. A plot of some variable across time should never be presented on 
axes that change part way across the page either in the vertical or horizontal dimension. Perhaps the time frame is 
changed from years to months. Perhaps this is to save space or because monthly data was not available for early 
years. In either case this confounds the presentation and destroys any value of the graph. If this is not done to 
purposefully confuse the reader, then it certainly is either lazy or sloppy work. 


Changing the units of measurement of the axis can smooth out a drop or accentuate one. If you want to show large 
changes, then measure the variable in small units, penny rather than thousands of dollars. And of course to 
continue the fraud, be sure that the axis does not begin at zero, zero. If it begins at zero, zero, then it becomes 
apparent that the axis has been manipulated. 


Perhaps you have a client that is concerned with the volatility of the portfolio you manage. An easy way to present 
the data is to use long time periods on the time series graph. Use months or better, quarters rather than daily or 
weekly data. If that doesn't get the volatility down then spread the time axis relative to the rate of return or 
portfolio valuation axis. If you want to show "quick" dramatic growth, then shrink the time axis. Any positive 
growth will show visually "high" growth rates. Do note that if the growth is negative then this trick will show the 
portfolio is collapsing at a dramatic rate. 


Again, the goal of descriptive statistics is to convey meaningful visuals that tell the story of the data. Purposeful 
manipulation is fraud and unethical at the worst, but even at its best, making these type of errors will lead to 
confusion on the part of the analysis. 
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Chapter Review 


A stem-and-leaf plot is a way to plot data and look at the distribution. In a stem-and-leaf plot, all data values 
within a class are visible. The advantage in a stem-and-leaf plot is that all values are listed, unlike a histogram, 
which gives classes of data values. A line graph is often used to represent a set of data values in which a quantity 
varies with time. These graphs are useful for finding trends. That is, finding a general pattern in data sets including 
temperature, sales, employment, company profit or cost over a period of time. A bar graph is a chart that uses 
either horizontal or vertical bars to show comparisons among categories. One axis of the chart shows the specific 
categories being compared, and the other axis represents a discrete value. Some bar graphs present bars clustered 


in groups of more than one (grouped bar graphs), and others show the bars divided into subparts to show 
cumulative effect (stacked bar graphs). Bar graphs are especially useful when categorical data is being used. 


A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width drawn 
adjacent to each other. The horizontal scale represents classes of quantitative data values and the vertical scale 
represents frequencies. The heights of the bars correspond to frequency values. Histograms are typically used for 
large, continuous, quantitative data sets. A frequency polygon can also be used when graphing large data sets with 
data points that repeat. The data usually goes on y-axis with the frequency being graphed on the x-axis. Time series 


graphs 


can be helpful when looking at large amounts of data for one variable over a period of time. 


For the next three exercises, use the data to construct a line graph. 
Exercise: 


Problem: 


Ina 


survey, 40 people were asked how many times they visited a store before making a major purchase. The 


results are shown in [link]. 


Number of times in store Frequency 
1 4 
2 10 
3 16 
4 6 
5 4 
Solution: 
18 
16 
14 
> 12 
o 
5 10 
+ 8 
c 6 
4 
2 
0 
1 2 3 4 5 


Number of times in store 


Exercise: 


Problem: 


Ina 


survey, several people were asked how many years it has been since they purchased a mattress. The 


results are shown in [link]. 


Years since last purchase Frequency 


0 2 
1 8 
2 13 
3 22 
4 16 
5 9 
Exercise: 
Problem: 


Several children were asked how many TV shows they watch each day. The results of the survey are shown in 
Uink]. 


Number of TV shows Frequency 
0 12 
1 18 
2 36 
3 7 
4 2 
Solution: 
40 
35 
30 
3 25 
ec 
5 20 
o 
Fr 15 
10 
5 
0 


0 1 2 3 4 
TV shows watched per day 


Exercise: 


Problem: 


The students in Ms. Ramirez’s math class have birthdays in each of the four seasons. [link] shows the four 
seasons, the number of students who have birthdays in each season, and the percentage (%) of students in 
each group. Construct a bar graph showing the number of students. 


Seasons Number of students Proportion of population 
Spring 8 24% 
Summer 9 26% 
Autumn 11 32% 
Winter 6 18% 
Exercise: 
Problem: 


Using the data from Mrs. Ramirez’s math class supplied in [link], construct a bar graph showing the 
percentages. 


Solution 
35% 
30% 
= 25% 
5 20% 
5 150 
& 15% 
2 
a 10% 
5% 
0% 
Spring Summer = Autumn Winter 
Birthdays in each season 
Exercise: 
Problem: 


David County has six high schools. Each school sent students to participate in a county-wide science 
competition. [link] shows the percentage breakdown of competitors from each school, and the percentage of 
the entire student population of the county that goes to each school. Construct a bar graph that shows the 
population percentage of competitors from each school. 


High school Science competition population Overall student population 


High school Science competition population Overall student population 


Alabaster 28.9% 8.6% 
Concordia 7.6% 23.2% 
Genoa 12.1% 15.0% 
Mocksville 18.5% 14.3% 
Tynneson 24.2% 10.1% 
West End 8.7% 28.8% 
Exercise: 
Problem: 


Use the data from the David County science competition supplied in [link]. Construct a bar graph that shows 
the county-wide population percentage of students at each school. 


Solution: 
35.0% 
30.0% 
25.0% 
20.0% 
15.0% 


Proportion (%) 


S 
=) 
s§ 


5.0% 


0.0% 
Alabaster Concordia Genoa Mocksville Tynneson West End 
Students in science competition from each school 


Exercise: 


Problem: 


Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. 
Fourteen people answered that they generally sell three cars; nineteen generally sell four cars; twelve 
generally sell five cars; nine generally sell six cars; eleven generally sell seven cars. Complete the table. 


Data value (# cars) Frequency Relative frequency Cumulative relative frequency 


Exercise: 


Problem: What does the frequency column in [link] sum to? Why? 


Solution: 


65 


Exercise: 


Problem: What does the relative frequency column in [link] sum to? Why? 


Exercise: 


Problem: What is the difference between relative frequency and frequency for each data value in [link]? 


Solution: 
The relative frequency shows the proportion of data points that have each value. The frequency tells the 
number of data points that have each value. 

Exercise: 


Problem: 


What is the difference between cumulative relative frequency and relative frequency for each data value? 
Exercise: 

Problem: 

To construct the histogram for the data in [link], determine appropriate minimum and maximum x and y 


values and the scaling. Sketch the histogram. Label the horizontal and vertical axes with words. Include 
numerical scaling. 


Solution: 


Answers will vary. One possible histogram is shown: 
20 


Frequency 
= 
Qo 


3 4 5 6 7 8 
Number of cars sold 


Exercise: 


Problem: Construct a frequency polygon for the following: 


Exercise: 


Pulse rates for women 
60-69 

70-79 

80-89 

90-99 

100-109 

110-119 


120-129 


Actual speed in a 30 MPH zone 
42-45 
46-49 
50-53 
54-57 


58-61 


Tar (mg) in nonfiltered cigarettes 
10-13 
14-17 
18-21 
22-25 


26-29 


Frequency 
12 
14 
11 
1 
1 
0 
1 
Frequency 
25 
14 
7 
3 
1 
Frequency 
1 
0 
15 
7 
2 


Problem: 


Construct a frequency polygon from the frequency distribution for the 50 highest ranked countries for depth 
of hunger. 


Depth of hunger Frequency 
230-259 21 
260-289 13 
290-319 5 

320-349 7 

350-379 1 

380-409 1 

410-439 1 

Solution: 


Find the midpoint for each class. These will be graphed on the x-axis. The frequency values will be graphed 


on the y-axis values. 
Depth of Hunger 


230-259 260-289 290-319 320-349 350-379 380-409 410-439 
Depth of hunger 


Exercise: 
Problem: 
Use the two frequency tables to compare the life expectancy of men and women from 20 randomly selected 


countries. Include an overlayed frequency polygon and discuss the shapes of the distributions, the center, the 
spread, and any outliers. What can we conclude about the life expectancy of women compared to men? 


Life expectancy at birth - women Frequency 


49-55 3 


Life expectancy at birth - women Frequency 


56-62 3 
63-69 1 
70-76 3 
77-83 8 
84-90 2 
Life expectancy at birth —- men Frequency 
49-55 3 
56-62 3 
63-69 1 
70-76 a 
77-83 7 
84-90 5 
Exercise: 
Problem: 


Construct a times series graph for (a) the number of male births, (b) the number of female births, and (c) the 
total number of births. 


Sex/Year 1855 1856 1857 1858 1859 1860 1861 
Female 45,545 49,582 50,257 50,324 51,915 51,220 52,403 
Male 47,804 52,239 53,158 53,694 54,628 54,409 54,606 


Total 93,349 101,821 103,415 104,018 106,543 105,629 107,009 


Sex/Year 1862 1863 1864 1865 1866 1867 1868 
Female 51,812 53,115 54,959 54,850 55,307 55,527 56,292 
Male 55,257 56,226 57,374 58,220 58,360 58,517 59,222 
Total 107,069 109,341 112,333 113,070 113,667 114,044 115,514 
Sex/Year 1870 1871 1872 1873 1874 1875 
Female 56,431 56,099 57,472 58,233 60,109 60,146 
Male 58,959 60,029 61,293 61,467 63,602 63,432 
Total 115,390 116,128 118,765 119,700 123,711 123,578 
Solution: 


Births in Scotland 

130,000 5 
125,000 4 
120,000 4 
115,000 4 
110,000 4 
105,000 4 
100,000 4 
95,000 4 

90,000 4 

85,000 4 

80,000 4 

75,000 + 

70,000 + 

65,000 4 


60,000 4 
55,000 4 
50,000 4 


45,000 + 
40,000 


Number of births 


SL 
fa, 9. 29. 9. 5. %, %. Yon Yan Xa, Xo, Xn, Xe, %, Mp, %, %, %, Ys, Yo, % 
85, y, ~B5, “By, “86, “Se, Gs. “85, Gh, 8, “Op, 8, O, “Bin, “Oy, “Bs, “8s, “Os, “Os, “x “B. 
COS OS I RS A I I Ee a RE 
Year 


— Both sexes —- Males ~— Females 


Exercise: 


Problem: 


The following data sets list full time police per 100,000 citizens along with homicides per 100,000 citizens for 
the city of Detroit, Michigan during the period from 1961 to 1973. 


Year 1961 1962 1963 1964 1965 1966 1967 
Police 260.35 269.8 272.04 272.96 272.51 261.34 268.89 


Homicides 8.6 8.9 8.52 8.89 13.07 14.57 21.36 


If 


oye 


Sf 


1] 


Year 1968 1969 1970 1971 1972 1973 
Police 295.99 319.87 341.43 356.59 376.69 390.19 


Homicides 28.03 31.49 37.39 46.26 47.24 52.33 


a. Construct a double time series graph using a common x-axis for both sets of data. 
b. Which variable increased the fastest? Explain. 
c. Did Detroit’s increase in police officers have an impact on the murder rate? Explain. 


Homework 


Exercise: 


Problem: [link] contains the 2010 obesity rates in U.S. states and Washington, DC. 


Percent Percent Percent 
State (%) State (%) State (%) 
Alabama 32.2 Kentucky 31.3 Nort 27.2 
Dakota 
Alaska 24.5 Louisiana 31.0 Ohio 29.2 
Arizona 24.3 Maine 26.8 Oklahoma 30.4 
Arkansas 30.1 Maryland 27.1 Oregon 26.8 
California 24.0 Massachusetts 23.0 Pennsylvania 28.6 
Colorado 21.0 Michigan 30.9 Rhode Island 25.5 
Connecticut 22.5 Minnesota 24.8 South 315 
Carolina 
soci ata South 
Delaware 28.0 Mississippi 34.0 aloe: 27.3 
ee 22.2 Missouri 30.5 Tennessee 30.8 
Florida 26.6 Montana 23.0 Texas 31.0 
Georgia 29.6 Nebraska 26.9 Utah 22.5 
Hawaii 22.7 Nevada 22.4 Vermont 23.2 


Percent Percent Percent 


State (%) State (%) State (%) 
Idaho 26.5 ew 25.0 Virginia 26.0 
, Hampshire , , 
Illinois 28.2 New Jersey 23.8 Washington 25.5 
Indiana 29.6 New Mexico 25.1 EL 32.5 
Virginia 
Iowa 28.4 New York 23.9 Wisconsin 26.3 
North , 
Kansas 29.4 Corslina 27.8 Wyoming 25.1 


a. Use arandom number generator to randomly pick eight states. Construct a bar graph of the obesity rates 
of those eight states. 

b. Construct a bar graph for all the states beginning with the letter "A." 

c. Construct a bar graph for all the states beginning with the letter "M." 


Solution: 


a. Example solution for using the random number generator for the TI-84+ to generate a simple random 
sample of 8 states. Instructions are as follows. 


o Number the entries in the table 1-51 (Includes Washington, DC; Numbered vertically) 
Press MATH 

Arrow over to PRB 

Press 5:randInt( 

Enter 51,1,8) 


o 0 0 0 


Eight numbers are generated (use the right arrow key to scroll through the numbers). The numbers 
correspond to the numbered states (for this example: {47 21 9 23 51 13 25 4}. If any numbers are 
repeated, generate a different number by using 5:randInt(51,1)). Here, the states (and Washington DC) 
are {Arkansas, Washington DC, Idaho, Maryland, Michigan, Mississippi, Virginia, Wyoming}. 


Corresponding percents are {30.1, 22.2, 26.5, 27.1, 30.9, 34.0, 26.0, 25.1}. 
40 


35 


Percent (%) 
nN 
i=) 


Percent (%) 


Alabama Alaska Arizona = Arkansas 


Percent (%) 


Exercise: 
Problem: 
Suppose that three book publishers were interested in the number of fiction paperbacks adult consumers 


purchase per month. Each publisher conducted a survey. In the survey, adult consumers were asked the 
number of fiction paperbacks they had purchased the previous month. The results are as follows: 


# of books Freq. Rel. freq. 
0 10 

1 12 

2 16 

3 12 

4 8 

5 6 

6 2 

8 2 


Publisher A 


# of books Freq. Rel. freq. 


0 18 
1 24 
2 24 
3 22 
4 15 
5 10 
7 5 
9 1 
Publisher B 
# of books Freq. Rel. freq. 
0-1 20 
2-3 35 
4-5 12 
6-7 2 
8-9 1 
Publisher C 


a. Find the relative frequencies for each survey. Write them in the charts. 

b. Use the frequency column to construct a histogram for each publisher's survey. For Publishers A and B, 
make bar widths of one. For Publisher C, make bar widths of two. 

c. In complete sentences, give two reasons why the graphs for Publishers A and B are not identical. 

d. Would you have expected the graph for Publisher C to look like the other two graphs? Why or why not? 

e. Make new histograms for Publisher A and Publisher B. This time, make bar widths of two. 

f. Now, compare the graph for Publisher C to the new graphs for Publishers A and B. Are the graphs more 
similar or more different? Explain your answer. 


Exercise: 


Problem: 


Often, cruise ships conduct all on-board transactions, with the exception of gambling, on a cashless basis. At 
the end of the cruise, guests pay one bill that covers all onboard transactions. Suppose that 60 single travelers 
and 70 couples were surveyed as to their on-board bills for a seven-day cruise from Los Angeles to the 
Mexican Riviera. Following is a summary of the bills for each group. 


Amount($) Frequency Rel. frequency 


51-100 5 
101-150 10 
151-200 15 
201-250 15 
251-300 10 
301-350 5 
Singles 
Amount($) Frequency Rel. frequency 
100-150 5 
201-250 5 
251-300 5 
301-350 5 
351-400 10 
401-450 10 
451-500 10 
501-550 10 
551-600 5 
601-650 5 
Couples 


a. Fill in the relative frequency for each group. 

b. Construct a histogram for the singles group. Scale the x-axis by $50 widths. Use relative frequency on 
the y-axis. 

c. Construct a histogram for the couples group. Scale the x-axis by $50 widths. Use relative frequency on 
the y-axis. 

d. Compare the two graphs: 


i. List two similarities between the graphs. 
ii. List two differences between the graphs. 
iii. Overall, are the graphs more similar or different? 


e. Construct a new graph for the couples by hand. Since each couple is paying for two individuals, instead 
of scaling the x-axis by $50, scale it by $100. Use relative frequency on the y-axis. 
f. Compare the graph for the singles with the new graph for the couples: 


i. List two similarities between the graphs. 
ii. Overall, are the graphs more similar or different? 


g. How did scaling the couples graph differently change the way you compared it to the singles graph? 
h. Based on the graphs, do you think that individuals spend the same amount, more or less, as singles as 
they do person by person as a couple? Explain why in one or two complete sentences. 


Solution: 
Amount($) Frequency Relative frequency 
51-100 5 0.08 
101-150 10 0.17 
151-200 15 0.25 
201-250 15 0.25 
251-300 10 0.17 
301-350 5 0.08 
Singles 
Amount($) Frequency Relative frequency 
100-150 5 0.07 
201-250 5 0.07 
251-300 5 0.07 
301-350 5 0.07 
351-400 10 0.14 
401-450 10 0.14 


451-500 10 0.14 


Amount($) Frequency Relative frequency 


501-550 10 0.14 

551-600 5 0.07 

601-650 5 0.07 
Couples 


a. See [link] and [link]. 

b. In the following histogram data values that fall on the right boundary are counted in the class interval, 
while values that fall on the left boundary are not counted (with the exception of the first interval where 
both boundary values are included). 

Onboard Charges for Singles 
7-Day Cruise Sailing 
to the Mexican Riviera from LA 


Relative frequency 


50 100 150 200 250 300 350 
Amount ($) 
c. In the following histogram, the data values that fall on the right boundary are counted in the class 
interval, while values that fall on the left boundary are not counted (with the exception of the first 


interval where values on both boundaries are included). 


Onboard Charges for Singles 
7-Day Cruise Sailing to the Mexican Riviera from LA 


Relative Frequency 


100 150 200 250 300 350 400 450 500 550 600 650 
Amount ($) 


d. Compare the two graphs: 


i. Answers may vary. Possible answers include: 


= Both graphs have a single peak. 
= Both graphs use class intervals with width equal to $50. 


ii. Answers may vary. Possible answers include: 


= The couples graph has a class interval with no values. 
= It takes almost twice as many class intervals to display the data for couples. 


iii. Answers may vary. Possible answers include: The graphs are more similar than different because 
the overall patterns for the graphs are the same. 


e. Check student's solution. 
f. Compare the graph for the Singles with the new graph for the Couples: 


i. = Both graphs have a single peak. 


= Both graphs display 6 class intervals. 
= Both graphs show the same general pattern. 


ii. Answers may vary. Possible answers include: Although the width of the class intervals for couples 
is double that of the class intervals for singles, the graphs are more similar than they are different. 


g. Answers may vary. Possible answers include: You are able to compare the graphs interval by interval. It 
is easier to compare the overall patterns with the new scale on the Couples graph. Because a couple 
represents two individuals, the new scale leads to a more accurate comparison. 

h. Answers may vary. Possible answers include: Based on the histograms, it seems that spending does not 
vary much from singles to individuals who are part of a couple. The overall patterns are the same. The 
range of spending for couples is approximately double the range for individuals. 


Exercise: 


Problem: 


Twenty-five randomly selected students were asked the number of movies they watched the previous week. 
The results are as follows. 


# of movies Frequency Relative frequency Cumulative relative frequency 
0 5 
1 9 
2 6 
3 4 
4 1 


a. Construct a histogram of the data. 
b. Complete the columns of the chart. 


Use the following information to answer the next two exercises: Suppose one hundred eleven people who shopped 
in a special t-shirt store were asked the number of t-shirts they own costing more than $19 each. 


40/111 
30/111 
20/111 


10/111 


Relative frequency 


0 


1 2 3 4 5 6 7 
Number of T-shirts costing more than $19 each 


Exercise: 


Problem: 
The percentage of people who own at most three t-shirts costing more than $19 each is approximately: 


a. 21 
b. 59 
c. 41 
d. Cannot be determined 


Solution: 


Cc 
Exercise: 
Problem: 
If the data were collected by asking the first 111 people who entered the store, then the type of sampling is: 
a. Cluster 
b. simple random 


c. stratified 
d. convenience 


Exercise: 


Problem: Following are the 2010 obesity rates by U.S. states and Washington, DC. 


State 


Alabama 


Alaska 


Arizona 


Arkansas 


California 


Colorado 


Connecticut 


Delaware 


Percent 
(%) 


State 


Kentucky 


Louisiana 
Maine 
Maryland 
Massachusetts 


Michigan 


Minnesota 


Mississippi 


Percent 
(%) 


31.3 


31.0 
26.8 
27.1 
23.0 


30.9 


24.8 


34.0 


State 


North 
Dakota 


Ohio 
Oklahoma 
Oregon 
Pennsylvania 
Rhode Island 


South 
Carolina 


South 
Dakota 


Percent 
(%) 


27.2 


29.2 
30.4 
26.8 
28.6 


25.5 


31.5 


27.3 


Percent Percent Percent 


State (%) State (%) State (%) 
ee 22.2 Missouri 30.5 Tennessee 30.8 
Florida 26.6 Montana 23.0 Texas 31.0 
Georgia 29.6 Nebraska 26.9 Utah 22.5 
Hawaii 22.7 Nevada 22.4 Vermont 23.2 
Idaho 26.5 ney 25.0 Virginia 26.0 
: Hampshire : 6 : 
Illinois 28.2 New Jersey 23.8 Washington 25.5 
Indiana 29.6 New Mexico 25.1 Me: 32.5 
Virginia 
Iowa 28.4 New York 23.9 Wisconsin 26.3 
North : 
Kansas 29.4 Carolina 27.8 Wyoming 25.1 


Construct a bar graph of obesity rates of your state and the four states closest to your state. Hint: Label the x- 
axis with the states. 


Solution: 


Answers will vary. 


Glossary 


Frequency 
the number of times a value of the data occurs 


Histogram 
a graphical representation in x-y form of the distribution of data in a data set; x represents the data and y 
represents the frequency, or relative frequency. The graph consists of contiguous rectangles. 


Relative Frequency 
the ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all 
outcomes 


Measures of the Location of the Data 


The common measures of location are quartiles and percentiles 
Quartiles are special percentiles. The first quartile, Q;, is the same as the 25" 
percentile, and the third quartile, Q3, is the same as the 75" percentile. The 
median, M, is called both the second quartile and the 50" percentile. 


To calculate quartiles and percentiles, the data must be ordered from smallest 
to largest. Quartiles divide ordered data into quarters. Percentiles divide 
ordered data into hundredths. To score in the 90" percentile of an exam does 
not mean, necessarily, that you received 90% on a test. It means that 90% of 
test scores are the same or less than your score and 10% of the test scores are 
the same or greater than your test score. 


Percentiles are useful for comparing values. For this reason, universities and 
colleges use percentiles extensively. One instance in which colleges and 
universities use percentiles is when SAT results are used to determine a 
minimum testing score that will be used as an acceptance factor. For 
example, suppose Duke accepts SAT scores at or above the 75" percentile. 
That translates into a score of at least 1220. 


Percentiles are mostly used with very large populations. Therefore, if you 
were to say that 90% of the test scores are less (and not the same or less) than 
your score, it would be acceptable because removing one particular data 
value is not significant. 


The median is a number that measures the "center" of the data. You can think 
of the median as the "middle value," but it does not actually have to be one of 
the observed values. It is a number that separates ordered data into halves. 
Half the values are the same number or smaller than the median, and half the 
values are the same number or larger. For example, consider the following 
data. 

eld be 65-725 4 82 9 10) 6.6% 6.37222 10 1 

Ordered from smallest to largest: 

1s: 2) 2-4 6.6.8> 7.2; 8: 8.3; 9° 10; 10-115 


Since there are 14 observations, the median is between the seventh value, 6.8, 
and the eighth value, 7.2. To find the median, add the two values together and 


divide by two. 
Equation: 


6847.2 — 


7 
2 


The median is seven. Half of the values are smaller than seven and half of the 
values are larger than seven. 


Quartiles are numbers that separate the data into quarters. Quartiles may or 
may not be part of the data. To find the quartiles, first find the median or 
second quartile. The first quartile, Q,, is the middle value of the lower half of 
the data, and the third quartile, Q3, is the middle value, or median, of the 
upper half of the data. To get the idea, consider the same data set: 

bed 2) 24. 626.857.2763 6.3598 102 10. 15 


The median or second quartile is seven. The lower half of the data are 1, 1, 
2, 2, 4, 6, 6.8. The middle value of the lower half is two. 
1? dees 6260 


The number two, which is part of the data, is the first quartile. One-fourth of 
the entire sets of values are the same as or less than two and three-fourths of 
the values are more than two. 


The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of 
the upper half is nine. 


The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set 
are less than nine. One-fourth (25%) of the ordered data set are greater than 
nine. The third quartile is part of the data set in this example. 


The interquartile range is a number that indicates the spread of the middle 
half or the middle 50% of the data. It is the difference between the third 
quartile (Q3) and the first quartile (Q,). 


IQR = Q3- Q, 


The IQR can help to determine potential outliers. A value is suspected to be 
a potential outlier if it is less than (1.5)(IQR) below the first quartile or 
more than (1.5)([QR) above the third quartile. Potential outliers always 
require further investigation. 


Note: 

NOTE 

A potential outlier is a data point that is significantly different from the other 
data points. These special data points may be errors or some kind of 
abnormality or they may be a key to understanding the data. 


Example: 
Exercise: 


Problem: 
For the following 13 real estate prices, calculate the JQR and determine 
if any prices are potential outliers. Prices are in dollars. 
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 
387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000 
Solution: 
Order the data from smallest to largest. 
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 
529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000 
M = 488,800 
Oi= 230,500 + 387,000 _ 399 759 

Y ’ 


Q, = £39,000 S 659,000 _ G49 QQ0 


IQR = 649,000 — 308,750 = 340,250 


(1.5)(IQR) = (1.5)(340,250) = 510,375 
Q, — (1.5)(IQR) = 308,750 — 510,375 = -201,625 
Qs + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375 


No house price is less than —201,625. However, 5,500,000 is more than 
1,159,375. Therefore, 5,500,000 is a potential outlier. 


Example: 
Exercise: 


Problem: 

For the two data sets in the test scores example, find the following: 
a. The interquartile range. Compare the two interquartile ranges. 
b. Any outliers in either set. 

Solution: 


The five number summary for the day and night classes is 


Minimum Qi Median Q3 Maximum 
Day 32 56 74.5 82.5 99 
Night 25-3 78 81 89 98 


a. The IQR for the day group is Q3 — Q, = 82.5 — 56 = 26.5 


The IQR for the night group is Q3 — Q; = 89 — 78 = 11 


The interquartile range (the spread or variability) for the day class 
is larger than the night class IQR. This suggests more variation 
will be found in the day class’s class test scores. 

b. Day class outliers are found using the IQR times 1.5 rule. So, 


© Qy - IQR(1.5) = 56 — 26.5(1.5) = 16.25 
© Qs + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25 


Since the minimum and maximum values for the day class are 
greater than 16.25 and less than 122.25, there are no outliers. 


Night class outliers are calculated as: 


510; TOR (5) = 78 — 1105) =615 
© Qs + IQR(1.5) = 89 + 11(1.5) = 105.5 


For this class, any test score less than 61.5 is an outlier. Therefore, 
the scores of 45 and 25.5 are outliers. Since no test score is greater 
than 105.5, there is no upper end outlier. 


Example: 
Fifty statistics students were asked how much sleep they get per school night 
(rounded to the nearest hour). The results were: 


Amount of sleep Cumulative 
per school night Relative relative 
(hours) Frequency frequency frequency 


Amount of sleep Cumulative 


per school night Relative relative 
(hours) Frequency frequency frequency 
4 2 0.04 0.04 

Ss 5 0.10 0.14 

6 7 0.14 0.28 

a 12 0.24 0.52 

8 14 0.28 0.80 

9 iy 0.14 0.94 

10 a 0.06 1.00 


Find the 28" percentile. Notice the 0.28 in the "cumulative relative 
frequency" column. Twenty-eight percent of 50 data values is 14 values. 
There are 14 values less than the 28" percentile. They include the two 4s, 
the five 5s, and the seven 6s. The 28" percentile is between the last six and 
the first seven. The 28" percentile is 6.5. 

Find the median. Look again at the "cumulative relative frequency" column 
and find 0.52. The median is the 50" percentile or the second quartile. 50% 
of 50 is 25. There are 25 values less than the median. They include the two 
As, the five 5s, the seven 6s, and eleven of the 7s. The median or 50" 
percentile is between the 25h or seven, and 26", or seven, values. The 
median is seven. 

Find the third quartile. The third quartile is the same as the 75" percentile. 
You can "eyeball" this answer. If you look at the "cumulative relative 
frequency" column, you find 0.52 and 0.80. When you have all the fours, 
fives, sixes and sevens, you have 52% of the data. When you include all the 
8s, you have 80% of the data. The 75" percentile, then, must be an eight. 
Another way to look at the problem is to find 75% of 50, which is 37.5, and 
round up to 38. The third quartile, Q3, is the 38" value, which is an eight. 


You can check this answer by counting the values. (There are 37 values 
below the third quartile and 12 values above.) 


Note: 
Try it 
Exercise: 


Problem: 


Forty bus drivers were asked how many hours they spend each day 
running their routes (rounded to the nearest hour). Find the 65" 


percentile. 


Amount of 
time spent on 
route (hours) 


2 


3 


Solution: 


Frequency 
i 
14 


10 


Relative 
frequency 


0.30 
0.35 
0.25 


0.10 


Cumulative 
relative 
frequency 
0.30 

0.65 

0.90 


1.00 


The 65" percentile is between the last three and the first four. 


The 65" percentile is 3.5. 


Example: 
Exercise: 


Problem: Using [link]: 


a. Find the 80" percentile. 
b. Find the 90" percentile. 
c. Find the first quartile. What is another name for the first quartile? 


Solution: 
Using the data from the frequency table, we have: 


a. The 80" percentile is between the last eight and the first nine in 
the table (between the 40" and 41° values). Therefore, we need to 


take the mean of the 40" an 41°" values. The 80" percentile 


— Ho 
= 49 85 


b. The 90" percentile will be the 45™ data value (location is 0.90(50) 
= 45) and the 45" data value is nine. 

c. Q; is also the 25" percentile. The 25" percentile location 
calculation: P55 = 0.25(50) = 12.5 ¥ 13 the 13" data value. Thus, 
the 25th percentile is six. 


A Formula for Finding the kth Percentile 


If you were to do a little research, you would find several formulas for 
calculating the k" percentile. Here is one of them. 


k = the k" percentile. It may or may not be part of the data. 
i = the index (ranking or position of a data value) 


n= the total number of data points, or observations 


e Order the data from smallest to largest. 

¢ Calculate i = 2 (n+ 1) 

e If iis an integer, then the k" percentile is the data value in the i“” 
position in the ordered set of data. 

e If iis not an integer, then round i up and round i down to the nearest 
integers. Average the two data values in these two positions in the 
ordered data set. This is easier to understand in an example. 


Example: 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order 
from smallest to largest. 

PE Ne DO BOG <2 20 Ul 3 OOo ye lA eo eons OO; 
62.046) O97 IA 2 OCA 16 


a. Find the 70" percentile. 
b. Find the 83" percentile. 


Solution: 
a o k=70 
o ij = the index 
o n=29 
i= — (n+1)= (229 + 1) = 21. Twenty-one is an integer, and 


the data value in the 21* position in the ordered data set is 64. The 
70" percentile is 64 years. 


b. © k=83" percentile 
o ij = the index 
o n=29 


i = ae (n+ 1) =)44)(29 + 1) = 24.9, which is NOT an integer. 
Round it down to 24 and up to 25. The age in the 24" position is 
71 and the age in the 25" position is 72. Average 71 and 72. The 
83" percentile is 71.5 years. 


Note: 
Try It 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order 
from smallest to largest. 


AO 21 225 2612 7290s Uo ao pos Ae aot os. 8 7. Oe; 
G2 264767. O00 ele ier a 74 oe 
Calculate the 20" percentile and the 55" percentile. 


Solution: 


k = 20. Index = i= =3-(n + 1) = 49. (29 + 1) =6. The age in the sixth 


position is 27. The 20H percentile is 27 years. 


k = 55. Index = i= = (n + 1) = 3329 + 1) = 16.5. Round down to 
16 and up to 17. The age in the 16" position is 52 and the age in the 
17" position is 55. The average of 52 and 55 is 53.5. The 55" 


percentile is 53.5 years. 


A Formula for Finding the Percentile of a Value in a Data Set 


e Order the data from smallest to largest. 
e x =the number of data values counting from the bottom of the data list 
up to but not including the data value for which you want to find the 


percentile. 

e y =the number of data values equal to the data value for which you want 
to find the percentile. 

e n= the total number of data. 


e Calculate Ertey (100). Then round to the nearest integer. 


Example: 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order 
from smallest to largest. 

1G 2 1G 25 D627, 20s ola to 0s 44s ay oot oo..o OO; 
B2EOANO)  OOe EY Leto sas 7 On pr 


a. Find the percentile for 58. 
b. Find the percentile for 25. 


Solution: 


a. Counting from the bottom of the list, there are 18 data values less 
than 58. There is one value of 58. 
x= 18 andy= | 2 Go) = es 
64" percentile. 

b. Counting from the bottom of the list, there are three data values 
less than 25. There is one value of 25. 


(100) = 63.80. 58 is the 


ane ae 


y= oandy 1 0p) = (100) 1207, lventyative 


is the 12" percentile. 


Interpreting Percentiles, Quartiles, and Median 


A percentile indicates the relative standing of a data value when data are 
sorted into numerical order from smallest to largest. Percentages of data 
values are less than or equal to the pth percentile. For example, 15% of data 
values are less than or equal to the 15" percentile. 


e Low percentiles always correspond to lower data values. 
e High percentiles always correspond to higher data values. 


A percentile may or may not correspond to a value judgment about whether it 
is "good" or "bad." The interpretation of whether a certain percentile is 
"good" or "bad" depends on the context of the situation to which the data 
applies. In some situations, a low percentile would be considered "good;" in 
other contexts a high percentile might be considered "good". In many 
situations, there is no value judgment that applies. 


Understanding how to interpret percentiles properly is important not only 
when describing data, but also when calculating probabilities in later chapters 
of this text. 


Note: 

NOTE 

When writing the interpretation of a percentile in the context of the given 
data, the sentence should contain the following information. 


e information about the context of the situation being considered 

e the data value (value of the variable) that represents the percentile 

e the percent of individuals or items with data values below the percentile 

e the percent of individuals or items with data values above the 
percentile. 


Example: 
Exercise: 


Problem: 


On a timed math test, the first quartile for time it took to finish the 
exam was 35 minutes. Interpret the first quartile in the context of this 
situation. 


Solution: 


e Twenty-five percent of students finished the exam in 35 minutes or 
less. 

e Seventy-five percent of students finished the exam in 35 minutes 
or more. 

e A low percentile could be considered good, as finishing more 
quickly on a timed exam is desirable. (If you take too long, you 
might not be able to finish.) 


Example: 
Exercise: 


Problem: 


On a 20 question math test, the 70" percentile for number of correct 
answers was 16. Interpret the 70" percentile in the context of this 
situation. 


Solution: 


e Seventy percent of students answered 16 or fewer questions 
correctly. 

e Thirty percent of students answered 16 or more questions 
correctly. 

e A higher percentile could be considered good, as answering more 
questions correctly is desirable. 


Note: 
Try It 
Exercise: 


Problem: 


On a 60 point written assignment, the 80" percentile for the number of 
points earned was 49. Interpret the 80 percentile in the context of this 
situation. 


Solution: 


Eighty percent of students earned 49 points or fewer. Twenty percent of 
students earned 49 or more points. A higher percentile is good because 
getting more points on an assignment is desirable. 


Example: 
Exercise: 


Problem: 


At a community college, it was found that the 30" percentile of credit 
units that students are enrolled for is seven units. Interpret the 30" 
percentile in the context of this situation. 


Solution: 


e Thirty percent of students are enrolled in seven or fewer credit 
units. 

e Seventy percent of students are enrolled in seven or more credit 
units. 

e In this example, there is no "good" or "bad" value judgment 
associated with a higher or lower percentile. Students attend 
community college for varied reasons and needs, and their course 
load varies according to their needs. 


Example: 

Sharpe Middle School is applying for a grant that will be used to add fitness 
equipment to the gym. The principal surveyed 15 anonymous students to 
determine how many minutes a day the students spend exercising. The 
results from the 15 anonymous students are shown. 

0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes 

10 minutes; 45 minutes; 30 minutes; 300 minutes; 90 minutes; 

30 minutes; 120 minutes; 60 minutes; 0 minutes; 20 minutes 

Determine the following five values. 


e Min=0 
O24) 

e Med = 40 
<6Q2= 60 

e Max = 300 


If you were the principal, would you be justified in purchasing new fitness 
equipment? Since 75% of the students exercise for 60 minutes or less daily, 
and since the IQR is 40 minutes (60 — 20 = 40), we know that half of the 
students surveyed exercise between 20 minutes and 60 minutes daily. This 
seems a reasonable amount of time spent exercising, so the principal would 
be justified in purchasing the new equipment. 

However, the principal needs to be careful. The value 300 appears to be a 
potential outlier. 

Q3 + 1.5(7.QR) = 60 + (1.5)(40) = 120. 

The value 300 is greater than 120 so it is a potential outlier. If we delete it 
and calculate the five values, we get the following values: 


e Min=0 
Cl a) 
= O3= bl) 
e Max = 120 


We still have 75% of the students exercising for 60 minutes or less daily and 
half of the students exercising between 20 and 60 minutes a day. However, 
15 students is a small sample and the principal should survey more students 
to be sure of his survey results. 
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Chapter Review 


The values that divide a rank-ordered set of data into 100 equal parts are 
called percentiles. Percentiles are used to compare and interpret data. For 
example, an observation at the 50" percentile would be greater than 50 
percent of the other obeservations in the set. Quartiles divide data into 
quarters. The first quartile (Q,) is the 25" percentile,the second quartile (Q> 
or median) is 50" percentile, and the third quartile (Q3) is the the 75" 
percentile. The interquartile range, or IQR, is the range of the middle 50 
percent of the data values. The [QR is found by subtracting Q, from Q3, and 
can help determine outliers by using the following two expressions. 


° Q3 + JQR(1.5) 
° Qi —JQR(1.5) 


Formula Review 


i (=) (n+ 1) 


where i = the ranking or position of a data value, 


k = the kth percentile, 


n = total number of data. 
Expression for finding the percentile of a data value: (2408) (100) 


where x = the number of values counting from the bottom of the data list up 
to but not including the data value for which you want to find the percentile, 


y = the number of data values equal to the data value for which you want to 
find the percentile, 


n = total number of data 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order 
from smallest to largest. 


18352 22. 25) 26; 272.29? 30" 3133; 36).07 4s 42: 47 52055 578: 
62; 64: 67: 69; 71; 72; 73; 74: 76; 77 


a. Find the 40" percentile. 
b. Find the 78" percentile. 
Solution: 


a. The 40" percentile is 37 years. 
b. The 78" percentile is 70 years. 


Exercise: 
Problem: 


Listed are 32 ages for Academy Award winning best actors in order 
from smallest to largest. 


18-168) 215227: 25; 26: 27, :29° 30; 31> 31233730; 37207, 415 427 47 52 
5D} 07300; 02; 04; 677.69) 712-72: 73:74. 76; 77 


a. Find the percentile of 37. 
b. Find the percentile of 72. 


Exercise: 
Problem: 


Jesse was ranked 37" in his graduating class of 180 students. At what 
percentile is Jesse’s ranking? 


Solution: 


Jesse graduated 37" out of a class of 180 students. There are 180 — 37 = 
143 students ranked below Jesse. There is one rank of 37. 


x= 143 andy= 1. z*°¥ (100) = 18105 (100) = 79.72. Jesse’s rank 
of 37 puts him at the 80" percentile. 

Exercise: 
Problem: 


a. For runners in a race, a low time means a faster run. The winners in 
a race have the shortest running times. Is it more desirable to have a 
finish time with a high or a low percentile when running a race? 

b. The 20" percentile of run times in a particular race is 5.2 minutes. 
Write a sentence interpreting the 20" percentile in the context of 
the situation. 

c. A bicyclist in the 90" percentile of a bicycle race completed the 
race in 1 hour and 12 minutes. Is he among the fastest or slowest 
cyclists in the race? Write a sentence interpreting the 90" percentile 
in the context of the situation. 


Exercise: 


Problem: 


a. For runners in a race, a higher speed means a faster run. Is it more 
desirable to have a speed with a high or a low percentile when 
running a race? 

b. The 40" percentile of speeds in a particular race is 7.5 miles per 
hour. Write a sentence interpreting the 40" percentile in the context 
of the situation. 


Solution: 


a. For runners in a race it is more desirable to have a high percentile 
for speed. A high percentile means a higher speed which is faster. 

b. 40% of runners ran at speeds of 7.5 miles per hour or less (slower). 
60% of runners ran at speeds of 7.5 miles per hour or more (faster). 


Exercise: 


Problem: 


On an exam, would it be more desirable to earn a grade with a high or 
low percentile? Explain. 


Exercise: 


Problem: 


Mina is waiting in line at the Department of Motor Vehicles (DMV). 
Her wait time of 32 minutes is the 85" percentile of wait times. Is that 
good or bad? Write a sentence interpreting the 85" percentile in the 
context of this situation. 


Solution: 


When waiting in line at the DMV, the 85" percentile would be a long 
wait time compared to the other people waiting. 85% of people had 
shorter wait times than Mina. In this context, Mina would prefer a wait 
time corresponding to a lower percentile. 85% of people at the DMV 
waited 32 minutes or less. 15% of people at the DMV waited 32 minutes 
or longer. 


Exercise: 


Problem: 


In a survey collecting data about the salaries earned by recent college 
graduates, Li found that her salary was in the 78" percentile. Should Li 
be pleased or upset by this result? Explain. 


Exercise: 


Problem: 


In a study collecting data about the repair costs of damage to 
automobiles in a certain type of crash tests, a certain model of car had 
$1,700 in damage and was in the 90" percentile. Should the 
manufacturer and the consumer be pleased or upset by this result? 
Explain and write a sentence that interprets the 90" percentile in the 
context of this problem. 


Solution: 


The manufacturer and the consumer would be upset. This is a large 
repair cost for the damages, compared to the other cars in the sample. 
INTERPRETATION: 90% of the crash tested cars had damage repair 
costs of $1700 or less; only 10% had damage repair costs of $1700 or 
more. 


Exercise: 


Problem: 


The University of California has two criteria used to set admission 
standards for freshman to be admitted to a college in the UC system: 


a. Students' GPAs and scores on standardized tests (SATs and ACTs) 
are entered into a formula that calculates an "admissions index" 
score. The admissions index score is used to set eligibility 
standards intended to meet the goal of admitting the top 12% of 
high school students in the state. In this context, what percentile 
does the top 12% represent? 


b. Students whose GPAs are at or above the 96" percentile of all 
students at their high school are eligible (called eligible in the local 
context), even if they are not in the top 12% of all students in the 
State. What percentage of students from each high school are 
"eligible in the local context"? 


Exercise: 


Problem: 


Suppose that you are buying a house. You and your realtor have 
determined that the most expensive house you can afford is the 34" 
percentile. The 34" percentile of housing prices is $240,000 in the town 
you want to move to. In this town, can you afford 34% of the houses or 
66% of the houses? 


Solution: 


You can afford 34% of houses. 66% of the houses are too expensive for 
your budget. INTERPRETATION: 34% of houses cost $240,000 or less. 
66% of houses cost $240,000 or more. 


Use the following information to answer the next six exercises. Sixty-five 
randomly selected car salespersons were asked the number of cars they 
generally sell in one week. Fourteen people answered that they generally sell 
three cars; nineteen generally sell four cars; twelve generally sell five cars; 
nine generally sell six cars; eleven generally sell seven cars. 

Exercise: 


Problem: First quartile = 


Exercise: 


Problem: Second quartile = median = 50" percentile = 


Solution: 


4 


Exercise: 


Problem: Third quartile = 


Exercise: 


Problem: Interquartile range (IQR) = - = 


Solution: 


6-4=2 


Exercise: 


Problem: 10" percentile = 


Exercise: 


Problem: 70" percentile = 


Solution: 


6 


Homework 


Exercise: 


Problem: 


The median age for U.S. blacks currently is 30.9 years; for U.S. whites it 
is 42.3 years. 


a. Based upon this information, give two reasons why the black 
median age could be lower than the white median age. 

b. Does the lower median age for blacks necessarily mean that blacks 
die younger than whites? Why or why not? 

c. How might it be possible for blacks and whites to die at 
approximately the same age, but for the median age for whites to be 


higher? 


Exercise: 
Problem: 
Six hundred adult Americans were asked by telephone poll, "What do 


you think constitutes a middle-class income?" The results are in [link]. 
Also, include left endpoint, but not the right endpoint. 


Salary ($) Relative frequency 
< 20,000 0.02 
20,000—25,000 0.09 
25,000—30,000 0.19 
30,000—40,000 0.26 
40,000—50,000 0.18 
50,000—75,000 0.17 
75,000—99,999 0.02 
100,000+ 0.01 


a. What percentage of the survey answered "not sure"? 

b. What percentage think that middle-class is from $25,000 to 
$50,000? 

c. Construct a histogram of the data. 


i. Should all bars have the same width, based on the data? Why 
or why not? 

ii. How should the <20,000 and the 100,000+ intervals be 
handled? Why? 


d. Find the 40" and 80" percentiles 
e. Construct a bar graph of the data 


Solution: 


a. 1 — (0.02+0.09+0.19+0.26+0.18+0.17+0.02+0.01) = 0.06 
b. 0.19+0.26+0.18 = 0.63 
c. Check student’s solution. 


d. 40" percentile will fall between 30,000 and 40,000 


goth percentile will fall between 50,000 and 75,000 
e. Check student’s solution. 


Glossary 


Interquartile Range 
or IQR, is the range of the middle 50 percent of the data values; the IQR 
is found by subtracting the first quartile from the third quartile. 


Outlier 
an observation that does not fit the rest of the data 


Percentile 
a number that divides ordered data into hundredths; percentiles may or 
may not be part of the data. The median of the data is the second quartile 
and the 50" percentile. The first and third quartiles are the 25" and the 
75" percentiles, respectively. 


Quartiles 


the numbers that separate the data into quarters; quartiles may or may 
not be part of the data. The second quartile is the median of the data. 


Measures of the Center of the Data 


The "center" of a data set is also a way of describing location. The two most widely used measures of the 
"center" of the data are the mean (average) and the median. To calculate the mean weight of 50 people, 
add the 50 weights together and divide by 50. Technically this is the arithmetic mean. We will discuss the 
geometric mean later. To find the median weight of the 50 people, order the data and find the number 
that splits the data into two equal parts meaning an equal number of observations on each side. The 
weight of 25 people are below this weight and 25 people are heavier than this weight. The median is 
generally a better measure of the center when there are extreme values or outliers because it is not 
affected by the precise numerical values of the outliers. The mean is the most common measure of the 
center. 


Note: 

NOTE 

The words “mean” and “average” are often used interchangeably. The substitution of one word for the 
other is common practice. The technical term is “arithmetic mean” and “average” is technically a center 
location. Formally, the arithmetic mean is called the first moment of the distribution by mathematicians. 
However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.” 


When each value in the data set is not unique, the mean can be calculated by multiplying each distinct 
value by its frequency and then dividing the sum by the total number of data values. The letter used to 
represent the sample mean is an x with a bar over it (pronounced “x bar”): z. 


The Greek letter : (pronounced "mew") represents the population mean. One of the requirements for the 
sample mean to be a good estimate of the population mean is for the sample taken to be truly random. 


To see that both ways of calculating the mean are the same, consider the sample: 
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4 


Equation: 
Bo st ee ne oie ee ee a 
11 
Equation: 
—  3(1) + 2(2) + 1(3) + 5(4) 
L= = 2.7 
11 
In the second calculation, the frequencies are 3, 2, 1, and 5. 
n+1 


You can quickly find the location of the median by using the expression *—. 


The letter n is the total number of data values in the sample. If n is an odd number, the median is the 
middle value of the ordered data (ordered smallest to largest). If n is an even number, the median is equal 
to the two middle values added together and divided by two after the data has been ordered. For example, 


if the total number of data values is 97, then ao o a = 49. The median is the 49" value in the 


n+1_ 10041 


ordered data. If the total number of data values is 100, then = 50.5. The median occurs 


midway between the 50" and 51° values. The location of the median and the value of the median are not 


the same. The upper case letter M is often used to represent the median. The next example illustrates the 
location of the median and the value of the median. 


Example: 
Exercise: 


Problem: 

AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody 
drug are as follows (smallest to largest): 

ae ale fap tele Op ilile Wee iS 4 se Se ee Ge ys 72 Ise Bile Bre whe Dale Dale WSs Usp Woe 227s Lye se 
gp Bile Swe Sisk sige sale syle Bisp 37s al0e alae ala aly 

Calculate the mean and the median. 


Solution: 


The calculation for the mean is: 


es [3+4+(8)(2)+10+11+12+13+14+(15)(2)+(16)(2)+...4+35+37+40-+ (44)(2)+47] __ 23.6 
> 40 act 
To find the median, M, first use the formula for the location. The location is: 


gap ae 
2 oe yy 


Starting at the smallest value, the median is located between the 20" and 21“ values (the two 24s): 
oe Ge fap tele Op IE We se 4s WS Se Tee Se Ive 72 Iie ile Bre ave Dale Vals Msp Mop Aloe ys LHe Ase 
ge Sills See (sist Sis Ble Sule Biss 3i7/o al0p alle aul aly 


M= aes — 94 


Example: 
Exercise: 


Problem: 


Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 
each earn $30,000. Which is the better measure of the "center": the mean or the median? 


Solution: 


ne 5,000,000+49(30,000) = 129,400 


50 
M = 30,000 
(There are 49 people who earn $30,000 and one person who earns $5,000,000.) 


The median is a better measure of the "center" than the mean because 49 of the values are 30,000 
and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle 
of the data. 


Another measure of the center is the mode. The mode is the most frequent value. There can be more than 
one mode in a data set as long as those values have the same frequency and that frequency is the highest. 
A data set with two modes is called bimodal. 


Example: 

Statistics exam scores for 20 students are as follows: 
5053595963637272727272767881838484849093 
Exercise: 


Problem: Find the mode. 
Solution: 


The most frequent score is 72, which occurs five times. Mode = 72. 


Example: 

Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 
and 480 each occur twice. 

When is the mode the best measure of the "center"? Consider a weight loss program that advertises a 
mean weight loss of six pounds the first week of the program. The mode might indicate that most people 
lose two pounds the first week, making the program less appealing. 


Note: 

NOTE 

The mode can be calculated for qualitative data as well as for quantitative data. For example, if the data 
set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red. 


Calculating the Arithmetic Mean of Grouped Frequency Tables 


When only grouped data is available, you do not know the individual data values (we only know intervals 
and interval frequencies); therefore, you cannot compute an exact mean for the data set. What we must do 
is estimate the actual mean by calculating the mean of a frequency table. A frequency table is a data 
representation in which grouped data is displayed along with the corresponding frequencies. To calculate 
the mean from a grouped frequency table we can apply the basic definition of mean: mean = 


data sum . : acts : “ae uae 
mumber of data values We simply need to modify the definition to fit within the restrictions of a frequency 


table. 


Since we do not know the individual data values we can instead find the midpoint of each interval. The 


: - +. lower boundary+upper boundar : Pee 
midpoint is Ay EP Y We can now modify the mean definition to be 


rym 


Mean of Frequency Table = SF where f = the frequency of the interval and m = the midpoint of 


the interval. 


Example: 
Exercise: 


Problem: 


A frequency table displaying professor Blount’s last statistic test is shown. Find the best estimate of 
the class mean. 


Grade interval Number of students 
50-56.5 1 
56.5-62.5 0 
62.5-68.5 4 
68.5-74.5 4 
74.5-80.5 2 
80.5-86.5 3 
86.5-92.5 4 
92.5-98.5 1 
Solution: 


e Find the midpoints for all intervals 


Grade interval Midpoint 
50—56.5 asi 


56.5-62.5 59.5 


Grade interval Midpoint 


62.5-68.5 65.5 
68.5-74.5 71.5 
74.5-80.5 77.9 
80.5-86.5 83.5 
86.5-92.5 89.5 
92.5-98.5 95.5 


e Calculate the sum of the product of each interval frequency and midpoint. ) fm 


53.25(1) + 59.5(0) + 65.5(4) + 71.5(4) + 77.5(2) + 83.5(3) + 89.5(4) + 95.5(1) = 1460.25 


fm 
yi: ce SS Sieh 


Note: 
Try It 
Exercise: 


Problem: 


Maris conducted a study on the effect that playing video games has on memory recall. As part of 
her study, she compiled the following data: 


Hours teenagers spend on video games Number of teenagers 
0-3.5 8) 

3.5-7.5 Zz 

7.5-11.5 i 

11.5-15.5 7 

15.5-19.5 9 


What is the best estimate for the mean number of hours spent playing video games? 


Solution: 


Find the midpoint of each interval, multiply by the corresponding number of teenagers, add the 
results and then divide by the total number of teenagers 

The midpoints are 1.75, 5.5, 9.5, 13.5,17.5. 

Mean = (1.75)(3) + (5.5)(7) + (9.5)(12) + (13.5)(7) + (17.5)(9) = 409.75/38 = 10.78 


References 
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“Demographics: Obesity — adult prevalence rate.” Indexmundi. Available online at 
http://www.indexmundi.com/g/r.aspx?t=50&v=2228&l=en (accessed April 3, 2013). 


Chapter Review 


The mean and the median can be calculated to help you find the "center" of a data set. The mean is the 
best estimate for the actual data set, but the median is the best measurement when a data set contains 
several outliers or extreme values. The mode will tell you the most frequently occuring datum (or data) in 
your data set. The mean, median, and mode are extremely helpful when you need to analyze your data, 
but if your data set consists of ranges which lack specific values, the mean may seem impossible to 
calculate. However, the mean can be approximated if you add the lower boundary with the upper 
boundary and divide by two to find the midpoint of each interval. Multiply each midpoint by the number 
of values found in the corresponding range. Divide the sum of these values by the total number of data 
values in the set. 


Formula Review 


ate 


b= xy Where f = interval frequencies and m = interval midpoints. 


Sum of all values in the sample 
Number of values in the sample 


The arithmetic mean for a sample (denoted by z) isx = 


Sum of all values in the population 


The arithmetic mean for a population (denoted by py) is wp = Nuisbar of values i tie population 


Exercise: 


Problem: Find the mean for the following frequency tables. 


a. Grade Frequency 


Grade Frequency 
49.5-59.5 2 
59.5-69.5 3 
69.5-79.5 8 
79.5-89.5 12 


89.5-99.5 fs) 


b. Daily low temperature Frequency 
49.5-59.5 53 
59.5-69.5 32 
69.5-79.5 15 
79.5-89.5 1 


89.5-99.5 0 


c. Points per game Frequency 
49.5-59.5 14 
59.5-69.5 32 
69.5-79.5 15 
79.5-89.5 23 


89.5-99.5 2 


Use the following information to answer the next three exercises: The following data show the lengths of 
boats moored in a marina. The data are ordered from smallest to largest: 


161719202021232425252526262727272829303233333435373940 
Exercise: 


Problem: Calculate the mean. 


Solution: 


Mean: 16+ 17+ 19+ 20+ 20+ 21+ 23+ 24+ 25+ 25+ 25+ 26+ 264+ 274+ 27+27+ 28+ 29 + 
30 + 32 + 33 + 33 + 34+ 35 + 37 + 39 + 40 = 738; 


738 — 
BS = 27.33 


Exercise: 


Problem: Identify the median. 


Exercise: 


Problem: Identify the mode. 


Solution: 


The most frequent lengths are 25 and 27, which occur three times. Mode = 25, 27 


Use the following information to answer the next three exercises: Sixty-five randomly selected car 
salespersons were asked the number of cars they generally sell in one week. Fourteen people answered 
that they generally sell three cars; nineteen generally sell four cars; twelve generally sell five cars; nine 
generally sell six cars; eleven generally sell seven cars. Calculate the following: 

Exercise: 


Problem: sample mean = x = 


Exercise: 


Problem: median = 


Solution: 


4 


Exercise: 


Problem: mode = 


Homework 


Exercise: 


Problem: 


The most obese countries in the world have obesity rates that range from 11.4% to 74.6%. This data 


is summarized in the following table. 


Percent of population obese 
11.4—20.45 

20.45-29.45 

29.45-38.45 

38.45-47.45 

47.45-56.45 

56.45-65.45 

65.45-74.45 


74.45-83.45 


Number of countries 


29 


13 


a. What is the best estimate of the average obesity percentage for these countries? 
b. The United States has an average obesity rate of 33.9%. Is this rate above average or below? 
c. How does the United States compare to other countries? 


Exercise: 


Problem: 


[link] gives the percent of children under five considered to be underweight. What is the best 


estimate for the mean percentage of underweight children? 


Percent of underweight children 


16—21.45 


21.45-26.9 


26.9-32.35 


Number of countries 


23 


Percent of underweight children Number of countries 


32.35-37.8 7 

37.8-43.25 6 

43.25-48.7 1 
Solution: 


_ 1328.65 _ 
The mean percentage, 7 = 35°? = 26.75 


Bringing It Together 


Exercise: 
Problem: 
Javier and Ercilia are supervisors at a shopping mall. Each was given the task of estimating the mean 


distance that shoppers live from the mall. They each randomly surveyed 100 shoppers. The samples 
yielded the following information. 


Javier Ercilia 
x 6.0 miles 6.0 miles 
s 4.0 miles 7.0 miles 


a. How can you determine which survey was correct ? 

b. Explain what the difference in the results of the surveys implies about the data. 

c. If the two histograms depict the distribution of values for each supervisor, which one depicts 
Ercilia's sample? How do you know? 


(b) 


Use the following information to answer the next three exercises: We are interested in the number of 
years students in a particular elementary statistics class have lived in California. The information in the 
following table is from the entire section. 


Number of years 
a 

14 

15 

18 

19 


20 


Exercise: 


Problem: What is the IQR? 


a. 8 

b. 11 
G15 
d. 35 


Solution: 


a 


Exercise: 


Problem: What is the mode? 


a. 19 

b. 19.5 

c. 14 and 20 
d. 22.65 


Exercise: 


Problem: Is this a sample or the entire population? 


a. sample 


b. entire population 


c. neither 


Solution: 


Frequency 
1 


3 


Number of years 
22 
23 
26 
40 


42 


Frequency 
1 


1 


Total = 20 


Glossary 


Frequency Table 
a data representation in which grouped data is displayed along with the corresponding frequencies 


Mean (arithmetic) 
a number that measures the central tendency of the data; a common name for mean is ‘average.’ The 


term 'mean' is a shortened form of ‘arithmetic mean.' By definition, the mean for a sample (denoted 
-.-— _ Sum of all values in the sample : : 
by %)is® = Fe ofvalucs inthe sarple? and the mean for a population (denoted by 1) is 
__ Sum of all values in the population 
b= Number of values in the population ° 


Mean (geometric) 
a measure of central tendency that provides a measure of average geometric growth over multiple 


time periods. 


Median 
a number that separates ordered data into halves; half the values are the same number or smaller 
than the median and half the values are the same number or larger than the median. The median may 
or may not be part of the data. 


Midpoint 
the mean of an interval in a frequency table 


Mode 
the value that appears most frequently in a set of data 


Sigma Notation and Calculating the Arithmetic Mean 


Formula for Population Mean 
Equation: 


1 
Ba 


Formula for Sample Mean 
Equation: 


This unit is here to remind you of material that you once studied and said at 
the time “I am sure that I will never need this!” 


Here are the formulas for a population mean and the sample mean. The 
Greek letter 1 is the symbol for the population mean and zx is the symbol for 
the sample mean. Both formulas have a mathematical symbol that tells us 
how to make the calculations. It is called Sigma notation because the 
symbol is the Greek capital letter sigma: &. Like all mathematical symbols 
it tells us what to do: just as the plus sign tells us to add and the x tells us to 
multiply. These are called mathematical operators. The & symbol tells us to 
add a specific list of numbers. 


Let’s say we have a sample of animals from the local animal shelter and we 
are interested in their average age. If we list each value, or observation, in a 
column, you can give each one an index number. The first number will be 
number 1 and the second number 2 and so on. 


Animal Age 


1 9 

2 1 

3 8.5 
4 10.5 
fs) 10 

6 8.5 
7 12 

8 8 

g 1 

10 9.5 


Each observation represents a particular animal in the sample. Purr is 
animal number one and is a 9 year old cat, Toto is animal number 2 and is a 
1 year old puppy and so on. 


To calculate the mean we are told by the formula to add up all these 
numbers, ages in this case, and then divide the sum by 10, the total number 
of animals in the sample. 


Animal number one, the cat Purr, is designated as X;, animal number 2, 
Toto, is designated as X» and so on through Dundee who is animal number 
10 and is designated as Xj. 


The i in the formula tells us which of the observations to add together. In 
this case it is X, through X49 which is all of them. We know which ones to 
add by the indexing notation, the i = 1 and the n or capital N for the 


population. For this example the indexing notation would be i = 1 and 
because it is a sample we use a small n on the top of the £ which would be 
10. 


The standard deviation requires the same mathematical operator and so it 
would be helpful to recall this knowledge from your past. 


The sum of the ages is found to be 78 and dividing by 10 gives us the 
sample mean age as 7.8 years. 
Exercise: 


Problem: 


A group of 10 children are on a scavenger hunt to find different color 
rocks. The results are shown in the [link] below. The column on the 
right shows the number of colors of rocks each child has. What is the 
mean number of rocks? 


Child Rock colors 
1 5 
2 5 
3 6 
4 2 
) 4 
6 3 


9 


10 


Exercise: 


Problem: 


10 


A group of children are measured to determine the average height of 
the group. The results are in [link] below. What is the mean height of 
the group to the nearest hundredth of an inch? 


Child 
Adam 
Betty 
Charlie 
Donna 
Earl 
Fran 
George 


Heather 


Height in inches 
45.21 
39.45 
43.78 
48.76 
37.39 
39.90 
45.56 


46.24 


Solution: 


39.48 in. 
Exercise: 


Problem: 


A person compares prices for five automobiles. The results are in 
[link]. What is the mean price of the cars the person has considered? 


Price 

$20,987 
$22,008 
$19,998 
$23,433 


$21,444 


Solution: 


$21,574 


Exercise: 


Problem: 


A customer protection service has obtained 8 bags of candy that are 
supposed to contain 16 ounces of candy each. The candy is weighed to 
determine if the average weight is at least the claimed 16 ounces. The 
results are in given in [link]. What is the mean weight of a bag of 
candy in the sample? 


Weight in ounces 
15.65 
16.09 
16.01 
15.99 
16.02 
16.00 
15.98 


16.08 


Solution: 


15.98 ounces 


Exercise: 


Problem: 


A teacher records grades for a class of 70, 72, 79, 81, 82, 82, 83, 90, 
and 95. What is the mean of these grades? 


Solution: 


81.56 

Exercise: 
Problem: 
A family is polled to see the mean of the number of hours per day the 
television set is on. The results, starting with Sunday, are 6, 3, 2, 3, 1, 
3, and 7 hours. What is the average number of hours the family had the 
television set on to the nearest whole number? 


Solution: 


4 hours 
Exercise: 
Problem: 
A city received the following rainfall for a recent year. What is the 


mean number of inches of rainfall the city received monthly, to the 
nearest hundredth of an inch? Use [link]. 


Month Rainfall in inches 
January 2.21 


February 3.12 


March 4.11 


April 2.09 
May 0.99 
June 1.08 
July 2.99 
August 0.08 
September 0.52 
October 1.89 
November 2.00 
December 3.06 

Solution: 

2.01 inches 

Exercise: 
Problem: 


A football team scored the following points in its first 8 games of the 
new season. Starting at game 1 and in order the scores are 14, 14, 24, 
21, 7, 0, 38, and 28. What is the mean number of points the team 
scored in these eight games? 


Solution: 


18.25 


Homework 


Exercise: 


Problem: 


A sample of 10 prices is chosen from a population of 100 similar 
items. The values obtained from the sample, and the values for the 
population, are given in [link] and [link] respectively. 


a. Is the mean of the sample within $1 of the population mean? 
b. What is the difference in the sample and population means? 


Prices of the sample 
$21 
$23 
$21 
$24 
$22 
$22 
$25 
$21 


$20 


$24 


Prices of the population Frequency 
$20 20 
$21 35 
$22 15 
$23 10 
$24 18 
$25 2 
Solution: 
a. Yes 


b. The sample is 0.5 higher. 
Exercise: 
Problem: 
A standardized test is given to ten people at the beginning of the 
school year with the results given in [link] below. At the end of the 


year the same people were again tested. 


a. What is the average improvement? 


b. Does it matter if the means are subtracted, or if the individual 
values are subtracted? 


Student Beginning score Ending score 
1 1100 1120 
2 980 1030 
3 1200 1208 
4 998 1000 
5 893 948 
6 1015 1030 
7 1217 1224 
8 1232 1245 
9 967 988 
10 988 997 
Solution: 
a. 20 
b. No 


Exercise: 


Problem: 


A small class of 7 students has a mean grade of 82 on a test. If six of 
the grades are 80, 82,86, 90, 90, and 95, what is the other grade? 


Solution: 


51 
Exercise: 


Problem: 


A class of 20 students has a mean grade of 80 on a test. Nineteen of the 
students has a mean grade between 79 and 82, inclusive. 


a. What is the lowest possible grade of the other student? 
b. What is the highest possible grade of the other student? 
Solution: 


a. 42 
b. 99 


Exercise: 
Problem: 


If the mean of 20 prices is $10.39, and 5 of the items with a mean of 
$10.99 are sampled, what is the mean of the other 15 prices? 


Solution: 


$10.19 


Skewness and the Mean, Median, and Mode 


Consider the following data set. 
Av 6: 6: Gi 7:7? 7: 7: 7; 7. 8 83: 9 10 


This data set can be represented by following histogram. Each interval has 
width one, and each value is located in the middle of an interval. 


4 5 6 7 8 9 10 


The histogram displays a symmetrical distribution of data. A distribution is 
symmetrical if a vertical line can be drawn at some point in the histogram 
such that the shape to the left and the right of the vertical line are mirror 
images of each other. The mean, the median, and the mode are each seven 
for these data. In a perfectly symmetrical distribution, the mean and the 
median are the same. This example has one mode (unimodal), and the 
mode is the same as the mean and median. In a symmetrical distribution 
that has two modes (bimodal), the two modes would be different from the 
mean and median. 


The histogram for the data: 4566677778 is not symmetrical. The right-hand 
side seems "chopped off" compared to the left side. A distribution of this 
type is called skewed to the left because it is pulled out to the left. We can 
formally measure the skewness of a distribution just as we can 
mathematically measure the center weight of the data or its general 
(a;—2)? 
ns> * 
The greater the deviation from zero indicates a greater degree of skewness. 
If the skewness is negative then the distribution is skewed left as in [link]. 
A positive measure of skewness indicates right skewness such as [link]. 


"speadness". The mathematical formula for skewness is: a3 = >> 


a 5 6 r 8 


The mean is 6.3, the median is 6.5, and the mode is seven. Notice that the 
mean is less than the median, and they are both less than the mode. The 
mean and the median both reflect the skewing, but the mean reflects it more 
sO. 


The histogram for the data: 67777888910, is also not symmetrical. It is 
skewed to the right. 


6 7 8 9 10 


The mean is 7.7, the median is 7.5, and the mode is seven. Of the three 
Statistics, the mean is the largest, while the mode is the smallest. Again, 
the mean reflects the skewing the most. 


To summarize, generally if the distribution of data is skewed to the left, the 
mean is less than the median, which is often less than the mode. If the 


distribution of data is skewed to the right, the mode is often less than the 
median, which is less than the mean. 


As with the mean, median and mode, and as we will see shortly, the 
variance, there are mathematical formulas that give us precise measures of 
these characteristics of the distribution of the data. Again looking at the 
formula for skewness we see that this is a relationship between the mean of 
the data and the individual observations cubed. 

Equation: 


where s is the sample standard deviation of the data, X; , and x is the 
arithmetic mean and n is the sample size. 


Formally the arithmetic mean is known as the first moment of the 
distribution. The second moment we will see is the variance, and skewness 
is the third moment. The variance measures the squared differences of the 
data from the mean and skewness measures the cubed differences of the 
data from the mean. While a variance can never be a negative number, the 
measure of skewness can and this is how we determine if the data are 
skewed right of left. The skewness for a normal distribution is zero, and any 
symmetric data should have skewness near zero. Negative values for the 
skewness indicate data that are skewed left and positive values for the 
skewness indicate data that are skewed right. By skewed left, we mean that 
the left tail is long relative to the right tail. Similarly, skewed right means 
that the right tail is long relative to the left tail. The skewness characterizes 
the degree of asymmetry of a distribution around its mean. While the mean 
and standard deviation are dimensional quantities (this is why we will take 
the square root of the variance ) that is, have the same units as the measured 
quantities X,, the skewness is conventionally defined in such a way as to 
make it nondimensional. It is a pure number that characterizes only the 
shape of the distribution. A positive value of skewness signifies a 
distribution with an asymmetric tail extending out towards more positive X 
and a negative value signifies a distribution whose tail extends out towards 


more negative X. A zero measure of skewness will indicate a symmetrical 
distribution. 


Skewness and symmetry become important when we discuss probability 
distributions in later chapters. 


Chapter Review 


Looking at the distribution of data can reveal a lot about the relationship 
between the mean, the median, and the mode. There are three types of 
distributions. A left (or negative) skewed distribution has a shape like 
[link]. A right (or positive) skewed distribution has a shape like [link]. A 
symmetrical distrubtion looks like [Link]. 


Formula Review 


(x;—z)* 
ns? 
Formula for Coefficient of Variation: 


CV = = - 100 conditioned upon x # 0 


Formula for skewness: a3 = > 


Use the following information to answer the next three exercises: State 
whether the data are symmetrical, skewed to the left, or skewed to the right. 
Exercise: 


Problem: 11122223333333344455 


Solution: 


The data are symmetrical. The median is 3 and the mean is 2.85. They 
are close, and the mode lies close to the middle of the data, so the data 
are symmetrical. 


Exercise: 


Problem: 161719222222222223 


Exercise: 


Problem:87878787878889899091 
Solution: 


The data are skewed right. The median is 87.5 and the mean is 88.2. 
Even though they are close, the mode lies to the left of the middle of 
the data, and there are many more instances of 87 than any other 
number, so the data are skewed right. 


Exercise: 
Problem: 
When the data are skewed left, what is the typical relationship between 
the mean and median? 
Exercise: 
Problem: 


When the data are symmetrical, what is the typical relationship 
between the mean and median? 


Solution: 


When the data are symmetrical, the mean and median are close or the 
same. 


Exercise: 


Problem: What word describes a distribution that has two modes? 


Exercise: 


Problem: Describe the shape of this distribution. 


Solution: 


The distribution is skewed right because it looks pulled out to the right. 
Exercise: 

Problem: 

Describe the relationship between the mode and the median of this 


distribution. 
10 


8 


6 


4 


2 


0 


Exercise: 


Problem: 


Describe the relationship between the mean and the median of this 
distribution. 


Solution: 


The mean is 4.1 and is slightly greater than the median, which is four. 


Exercise: 


Problem: Describe the shape of this distribution. 


Exercise: 


Problem: 


Describe the relationship between the mode and the median of this 
distribution. 


Solution: 


The mode and the median are the same. In this case, they are both five. 
Exercise: 
Problem: 


Are the mean and the median the exact same in this distribution? Why 
or why not? 


Exercise: 


Problem: Describe the shape of this distribution. 


OrRPFNWA UA DN OO 


Solution: 


The distribution is skewed left because it looks pulled out to the left. 
Exercise: 
Problem: 


Describe the relationship between the mode and the median of this 


distribution. 
8 


OrRPFNWA ODN 


Exercise: 


Problem: 


Describe the relationship between the mean and the median of this 
distribution. 


OrRPFNWH ADDN OO 


Solution: 
The mean and the median are both six. 
Exercise: 
Problem: The mean and median for the data are the same. 
345566667777777 


Is the data perfectly symmetrical? Why or why not? 
Exercise: 


Problem: 


Which is the greatest, the mean, the mode, or the median of the data 
set? 


111112121212131517222222 
Solution: 


The mode is 12, the median is 12.5, and the mean is 15.1. The mean is 
the largest. 


Exercise: 


Problem: 


Which is the least, the mean, the mode, and the median of the data set? 


5656565859606264646567 
Exercise: 
Problem: 


Of the three measures, which tends to reflect skewing the most, the 
mean, the mode, or the median? Why? 


Solution: 
The mean tends to reflect skewing the most because it is affected the 
most by outliers. 
Exercise: 
Problem: 


In a perfectly symmetrical distribution, when would the mode be 
different from the mean and median? 


Homework 


Exercise: 


Problem: 


The median age of the U.S. population in 1980 was 30.0 years. In 
1991, the median age was 33.1 years. 


a. What does it mean for the median age to rise? 

b. Give two reasons why the median age could rise. 

c. For the median age to rise, is the actual number of children less in 
1991 than it was in 1980? Why or why not? 


Measures of the Spread of the Data 


An important characteristic of any set of data is the variation in the data. In some data sets, the data values are 
concentrated closely near the mean; in other data sets, the data values are more widely spread out from the 
mean. The most common measure of variation, or spread, is the standard deviation. The standard deviation is a 
number that measures how far data values are from their mean. 


The standard deviation 


e provides a numerical measure of the overall amount of variation in a data set, and 
e can be used to determine whether a particular data value is close to or far from the mean. 


The standard deviation provides a measure of the overall variation in a data set 


The standard deviation is always positive or zero. The standard deviation is small when the data are all 
concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the 
data values are more spread out from the mean, exhibiting more variation. 


Suppose that we are studying the amount of time customers wait in line at the checkout at supermarket A and 
supermarket B. The average wait time at both supermarkets is five minutes. At supermarket A, the standard 
deviation for the wait time is two minutes; at supermarket B. The standard deviation for the wait time is four 
minutes. 


Because supermarket B has a higher standard deviation, we know that there is more variation in the wait times 
at supermarket B. Overall, wait times at supermarket B are more spread out from the average; wait times at 
supermarket A are more concentrated near the average. 


Calculating the Standard Deviation 


If x is a number, then the difference "x minus the mean" is called its deviation. In a data set, there are as many 
deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If the 
numbers belong to a population, in symbols a deviation is x — 1. For sample data, in symbols a deviation is x — x 


The procedure to calculate the standard deviation depends on whether the numbers are the entire population or 
are data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent 
the standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s 
represents the sample standard deviation and the Greek letter o (sigma, lower case) represents the population 
standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate 
of o. 


To calculate the standard deviation, we need to calculate the variance first. The variance is the average of the 
squares of the deviations (the x — z values for a sample, or the x — p: values for a population). The symbol o* 
represents the population variance; the population standard deviation o is the square root of the population 
variance. The symbol s? represents the sample variance; the sample standard deviation s is the square root of the 
sample variance. You can think of the standard deviation as a special average of the deviations. Formally, the 
variance is the second moment of the distribution or the first moment around the mean. Remember that the 
mean is the first moment of the distribution. 


If the numbers come from a census of the entire population and not a sample, when we calculate the average of 
the squared deviations to find the variance, we divide by N, the number of items in the population. If the data 


are from a sample rather than a population, when we calculate the average of the squared deviations, we divide 
by n—-1, one less than the number of items in the sample. 


Formulas for the Sample Standard Deviation 


ee 
d(x—-a fez i= 
es= ea) eal fe-®) ors = —4 


e For the sample standard deviation, the denominator is n - 1, that is the sample size minus 1. 


Formulas for the Population Standard Deviation 


2 teehee aoe / i 
oo = yf Se ore = yf Sie oro = ~— — 


e For the population standard deviation, the denominator is N, the number of items in the population. 


In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f 
is one. If a value appears three times in the data set or population, fis three. Two important observations 
concerning the variance and standard deviation: the deviations are measured from the mean and the deviations 
are squared. In principle, the deviations could be measured from any point, however, our interest is 
measurement from the center weight of the data, what is the "normal" or most usual value of the observation. 
Later we will be trying to measure the "unusualness" of an observation or a sample mean and thus we need a 
measure from the mean. The second observation is that the deviations are squared. This does two things, first it 
makes the deviations all positive and second it changes the units of measurement from that of the mean and the 
original observations. If the data are weights then the mean is measured in pounds, but the variance is measured 
in pounds-squared. One reason to use the standard deviation is to return to the original units of measurement by 
taking the square root of the variance. Further, when the deviations are squared it explodes their value. For 
example, a deviation of 10 from the mean when squared is 100, but a deviation of 100 from the mean is 10,000. 
What this does is place great weight on outliers when calculating the variance. 


Types of Variability in Samples 


When trying to study a population, a sample is often used, either for convenience or because it is not possible to 
access the entire population. Variability is the term used to describe the differences that may occur in these 
outcomes. Common types of variability include the following: 


e Observational or measurement variability 
e Natural variability 
e Induced variability 
e Sample variability 


Here are some examples to describe each type of variability. 


Example 1: Measurement variability 

Measurement variability occurs when there are differences in the instruments used to measure or in the people 
using those instruments. If we are gathering data on how long it takes for a ball to drop from a height by having 
students measure the time of the drop with a stopwatch, we may experience measurement variability if the two 
stopwatches used were made by different manufacturers: For example, one stopwatch measures to the nearest 
second, whereas the other one measures to the nearest tenth of a second. We also may experience measurement 
variability because two different people are gathering the data. Their reaction times in pressing the button on the 


stopwatch may differ; thus, the outcomes will vary accordingly. The differences in outcomes may be affected by 
measurement variability. 


Example 2: Natural variability 

Natural variability arises from the differences that naturally occur because members of a population differ from 
each other. For example, if we have two identical corn plants and we expose both plants to the same amount of 
water and sunlight, they may still grow at different rates simply because they are two different corn plants. The 
difference in outcomes may be explained by natural variability. 


Example 3: Induced variability 

Induced variability is the counterpart to natural variability; this occurs because we have artificially induced an 
element of variation (that, by definition, was not present naturally): For example, we assign people to two 
different groups to study memory, and we induce a variable in one group by limiting the amount of sleep they 
get. The difference in outcomes may be affected by induced variability. 


Example 4: Sample variability 
Sample variability occurs when multiple random samples are taken from the same population. For example, if I 
conduct four surveys of 50 people randomly selected from a given population, the differences in outcomes may 
be affected by sample variability. 


Example: 

In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the 
ages of her students. The following data are the ages fora SAMPLE of n = 20 fifth grade students. The ages are 
rounded to the nearest half year: 

$2 Shoe Shisp JOR OP IOs Oe Ose 10) sp TOL op Oise ilile die dike ills ails ails TL Ise ikilsse JEL se 

Equation: 


9 + 9.5(2) + 10(4) + 10.5(4) + 11(6) + 11.5(3) 


= 10.525 
20 


a 
The average age is 10.53 years, rounded to two places. 

The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square 
root of the variance. We will explain the parts of the table after calculating s. 


Data Freq. Deviations Deviations” (Freq.)(Deviations”) 

x f (x—@) ie). (He-2) 

9 1 9 — 10.525 = -1.525 (1 525)- 2.325625 1 x 2.325625 = 2.325625 
C5) 2 9.5 — 10.525 = -1.025 (-1.025)* = 1.050625 2 x 1.050625 = 2.101250 
10 4 10 — 10.525 =—0.525 (-0.525)* = 0.275625 4 x 0.275625 = 1.1025 
10.5 4 10.5 — 10.525 =—0.025 (0.025)? = 0.000625 A x 0.000625 = 0.0025 
11 6 11 — 10.525 = 0.475 (0.475)? = 0.225625 6 x 0.225625 = 1.35375 


Data Freq. Deviations Deviations? (Freq.)(Deviations?) 
11.5 3 11.5 — 10.525 = 0.975 (0.975)? = 0.950625 3 x 0.950625 = 2.851875 


The total is 9.7375 


The sample variance, s?, is equal to the sum of the last column (9.7375) divided by the total number of data 
values minus one (20 — 1): 
7 OT 
ss = 5) =a 0.5125 
The sample standard deviation s is equal to the square root of the sample variance: 


Ss = /0.5125 = 0.715891, which is rounded to two decimal places, s = 0.72. 


Explanation of the standard deviation calculation shown in the table 


The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean 
than is the data value 11 which is indicated by the deviations 0.97 and 0.47. A positive deviation occurs when 
the data value is greater than the mean, whereas a negative deviation occurs when the data value is less than the 
mean. The deviation is —1.525 for the data value nine. If you add the deviations, the sum is always zero. (For 
[link], there are n = 20 deviations.) So you cannot simply add the deviations to get the spread of the data. By 
squaring the deviations, you make them positive numbers, and the sum will also be positive. The variance, then, 
is the average squared deviation. By squaring the deviations we are placing an extreme penalty on observations 
that are far from the mean; these observations get greater weight in the calculations of the variance. We will see 
later on that the variance (standard deviation) plays the critical role in determining our conclusions in inferential 
statistics. We can begin now by using the standard deviation as a measure of "unusualness." "How did you do on 
the test?" "Terrific! Two standard deviations above the mean." This, we will see, is an unusually good exam 
grade. 


The variance is a squared measure and does not have the same units as the data. Taking the square root solves 
the problem. The standard deviation measures the spread in the same units as the data. 


Notice that instead of dividing by n = 20, the calculation divided by n — 1 = 20 — 1 = 19 because the data is a 
sample. For the sample variance, we divide by the sample size minus one (n — 1). Why not divide by n? The 
answer has to do with the population variance. The sample variance is an estimate of the population 
variance. This estimate requires us to use an estimate of the population mean rather than the actual population 
mean. Based on the theoretical mathematics that lies behind these calculations, dividing by (n — 1) gives a better 
estimate of the population variance. 


The standard deviation, s or o, is either zero or larger than zero. Describing the data with reference to the spread 
is called "variability". The variability in data depends upon the method by which the outcomes are obtained; for 
example, by measuring or by random sampling. When the standard deviation is zero, there is no spread; that is, 
the all the data values are equal to each other. The standard deviation is small when the data are all concentrated 
close to the mean, and is larger when the data values show more variation from the mean. When the standard 
deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s or 0 
very large. 


Example: 
Exercise: 


Problem: Use the following data (first exam scores) from Susan Dean's spring pre-calculus class: 


Giayp He Kee Ale Isisie Sisy 55g ile Giese ive Gite (Gree (GS) (aise 7/23 Wise TAs Wap tekop tais'e (stave takelp fetes Sloe Gs Gyaip (yale (syale 
94; 96; 100 


a. Create a chart containing the data, frequencies, relative frequencies, and cumulative relative 
frequencies to three decimal places. 
b. Calculate the following to one decimal place: 


i. The sample mean 

ii. The sample standard deviation 
iii. The median 
iv. The first quartile 

v. The third quartile 
vi. IQR 


Solution: 
a. See [link] 


b. i. The sample mean = 73.5 
ii. The sample standard deviation = 17.9 
iii. The median = 73 
iv. The first quartile = 61 
v. The third quartile = 90 
vi. IQR = 90 — 61 = 29 


Data Frequency Relative frequency Cumulative relative frequency 
33 1 0.032 0.032 
42 1 0.032 0.064 
49 2 0.065 0.129 
53 1 0.032 0.161 
55 2 0.065 0.226 
61 1 0.032 0.258 
63 1 0.032 0.29 
67 1 0.032 0.322 
68 2 0.065 0.387 
69 2 0.065 0.452 


72 1 0.032 0.484 


Data Frequency Relative frequency Cumulative relative frequency 


73 1 0.032 0.516 
74 1 0.032 0.548 
78 1 0.032 0.580 
80 1 0.032 0.612 
83 1 0.032 0.644 
88 3 0.097 0.741 
90 1 0.032 0.773 
92 1 0.032 0.805 
94 4 0.129 0.934 
96 1 0.032 0.966 
100 1 0.032 0.998 (Why isn't this value 1? ANSWER: Rounding) 


Standard deviation of Grouped Frequency Tables 


Recall that for grouped data we do not know individual data values, so we cannot describe the typical value of 

the data with precision. In other words, we cannot find the exact mean, median, or mode. We can, however, 

determine the best estimate of the measures of center by finding the mean of the grouped data with the formula: 
fm 

Mean of Frequency Table = 25m 


a 


where f = interval frequencies and m = interval midpoints. 
Just as we could not find the exact mean, neither can we find the exact standard deviation. Remember that 


standard deviation describes numerically the expected deviation a data value has from the mean. In simple 
English, the standard deviation allows us to compare how “unusual” individual data is compared to the mean. 


Example: 
Find the standard deviation for the data in [link]. 


Class Frequency, f Midpoint, m f-m f(m—2z) 
0-2 1 1 i 1(1 — 7.58)? = 43.26 


ad 6 4 6-4= 24 6(4 — 7.58)? = 76.77 


Class Frequency, f Midpoint, m f-m f(m—2z)? 


6-8 10 7 10-7=70 10(7 — 7.58)? = 3.33 

9-11 7 10 7-10=70 7(10 — 7.58)? = 41.10 

12-14 0 ile: 0-13=0 0(13 — 7.58)? =0 
26=n C= Se aie s? = 368) — 12.25 


For this data set, we have the mean, x = 7.58 and the standard deviation, s, = 3.5. This means that a randomly 

selected data value would be expected to be 3.5 units from the mean. If we look at the first class, we see that 

the class midpoint is equal to one. This is almost two full standard deviations from the mean since 7.58 — 3.5 — 
ee 

3.5 = 0.58. While the formula for calculating the standard deviation is not complicated, sz; = J tess 

where 

Sy = sample standard deviation, = sample mean, the calculations are tedious. It is usually best to use 


technology when performing the calculations. 


Comparing Values from Different Data Sets 


The standard deviation is useful when comparing data values that come from different data sets. If the data sets 
have different means and standard deviations, then comparing the data values directly can be misleading. 


e For each data value x, calculate how many standard deviations away from its mean the value is. 
e Use the formula: x = mean + (#ofSTDEVs)(standard deviation); solve for #0fSTDEVs. 
e #of STDEVs = ae 


standard deviation 
¢ Compare the results of this calculation. 


#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the formulas become: 


— —_ £-f 
Sample x=2+2zs z=45 
A _ _ 2p 
Population x= p+ Zo es 
Example: 
Exercise: 
Problem: 


Two students, John and Ali, from different high schools, wanted to find out who had the highest GPA 
when compared to his school. Which student had the highest GPA when compared to his school? 


Student GPA School mean GPA School standard deviation 


John 2.85 3.0 0.7 
Ali VY 80 10 
Solution: 


For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from the 
average, for his school. Pay careful attention to signs when comparing and interpreting the answer. 


w Sy of STDEVs= value —mean aed 


standard deviation o 


= — DEBS) 
For John, z = #ofSTDEVs = —j>— =—0.21 
cress — a 
For Ali, z = #ofSTDEVs = ~~ = —0.3 


John has the better GPA when compared to his school because his GPA is 0.21 standard deviations below 
his school's mean while Ali's GPA is 0.3 standard deviations below his school's mean. 


John's z-score of —0.21 is higher than Ali's z-score of —0.3. For GPA, higher values are better, so we 
conclude that John has the better GPA when compared to his school. 


Note: 
Try It 
Exercise: 


Problem: 
Two swimmers, Angie and Beth, from different teams, wanted to find out who had the fastest time for the 


50 meter freestyle when compared to her team. Which swimmer had the fastest time when compared to 
her team? 


Swimmer Time (seconds) Team mean time Team standard deviation 
Angie 26.2 27.2 0.8 
Beth 27.3 30.1 1.4 

Solution: 


Ga pe PE 
POU ANC 2 = Sars = 


For Beth: z = Shi gh. =—2 


The following lists give a few facts that provide a little more insight into what the standard deviation tells us 
about the distribution of the data. 
For ANY data set, no matter what the distribution of the data is: 


e Atleast 75% of the data is within two standard deviations of the mean. 
e Atleast 89% of the data is within three standard deviations of the mean. 
e Atleast 95% of the data is within 4.5 standard deviations of the mean. 

e This is known as Chebyshev's Rule. 


For data having a Normal Distribution, which we will examine in great detail later: 


e Approximately 68% of the data is within one standard deviation of the mean. 

e Approximately 95% of the data is within two standard deviations of the mean. 

¢ More than 99% of the data is within three standard deviations of the mean. 

e This is known as the Empirical Rule. 

e It is important to note that this rule only applies when the shape of the distribution of the data is bell- 
shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaussian" 
probability distribution in later chapters. 


Coefficient of Variation 


Another useful way to compare distributions besides simple comparisons of means or standard deviations is to 
adjust for differences in the scale of the data being measured. Quite simply, a large variation in data with a large 
mean is different than the same variation in data with a small mean. To adjust for the scale of the underlying 
data the Coefficient of Variation (CV) has been developed. Mathematically: 

Equation: 


CV = * +100 conditioned upon z # 0, where s is the standard deviation of the data and Z is the mean. 
x 


We can see that this measures the variability of the underlying data as a percentage of the mean value; the center 
weight of the data set. This measure is useful in comparing risk where an adjustment is warranted because of 
differences in scale of two data sets. In effect, the scale is changed to common scale, percentage differences, and 
allows direct comparison of the two or more magnitudes of variation of different data sets. 


References 
Data from Microsoft Bookshelf. 


King, Bill.“Graphically Speaking.” Institutional Research, Lake Tahoe Community College. Available online at 
http://www. ltcc.edu/web/about/institutional-research (accessed April 3, 2013). 


Chapter Review 


The standard deviation can help you calculate the spread of data. There are different equations to use if are 
calculating the standard deviation of a sample or of a population. 


e The Standard Deviation allows us to compare individual data or classes to the data set mean numerically. 
(e-2)? f(e—2) 

tae je poe ae 
calculate the standard deviation of a population, we would use the population mean, p/, and the formula o = 


| eon yf ee 


is the formula for calculating the standard deviation of a sample. To 


Formula Review 


[So fm? _9 S$, = sample standard deviation 
Sy = \/ —— — 2 where _ 
n x = sample mean 


n ; = 
Ss = (= 2) —nz 
ae) = 4) 2 or = / <1 For the 


nal 71 
sample standard deviation, the denominator is n - 1, that is the sample size - 1. 


Formulas for Sample Standard Deviation s = 


2 2 
Formulas for Population Standard Deviationo = i org = a eae Y org= — p? For 


the population standard deviation, the denominator is N, the number of items in the population. 


Use the following information to answer the next two exercises: The following data are the distances between 20 
retail stores and a large distribution center. The distances are in miles. 

29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 96; 96; 99; 106; 112; 127; 145; 150 

Exercise: 


Problem: 


Use a graphing calculator or computer to find the standard deviation and round to the nearest tenth. 
Solution: 


s=34.5 


Exercise: 


Problem: Find the value that is one standard deviation below the mean. 
Exercise: 
Problem: 
Two baseball players, Fredo and Karl, on different teams wanted to find out who had the higher batting 


average when compared to his team. Which baseball player had the higher batting average when compared 
to his team? 


Baseball player Batting average Team batting average Team standard deviation 
Fredo 0.158 0.166 0.012 
Karl 0.177 0.189 0.015 
Solution: 
7 — 0.158-0.166 _ 
For Predocg = aig > > UG7 


«7 = OATT-0.189 _ _, 
For Karl: z aE: 0.8 


Fredo’s z-score of —0.67 is higher than Karl’s z-score of —0.8. For batting average, higher values are better, 
so Fredo has a better batting average compared to his team. 


Exercise: 


Problem: Use [link] to find the value that is three standard deviations: 


e aabove the mean 
e bbelow the mean 


Find the standard deviation for the following frequency tables using the formula. Check the calculations with 
the TI 83/84. 
Exercise: 


Problem: 


Find the standard deviation for the following frequency tables using the formula. Check the calculations 
with the TI 83/84. 


a. Grade Frequency 
49.5-59.5 2 
59.5-69.5 3 
69.5-79.5 8 
79.5-89.5 12 
89.5-99.5 5 

b. Daily low temperature Frequency 
49.5-59.5 23 
59.5-69.5 32 
69.5-79.5 15 


79.5-89.5 1 


Daily low temperature Frequency 


89.5-99.5 0 
c. Points per game Frequency 
49.5-59.5 14 
59.5-69.5 32 
69.5-79.5 15 
79.5-89.5 23 
89.5-99.5 2 
Solution: 


i3e= 4) Se Sal = 79.07 = 10.88 


; = 


‘m2 = 
a See 1 802483. — 60.94? = 7.62 


101 


C 8, = 4f —— — 2 = 1/ Se — 70.66? = 11.14 


Homework 


Use the following information to answer the next nine exercises: The population parameters below describe the 
full-time equivalent number of students (FTES) each year at Lake Tahoe Community College from 1976-1977 
through 2004-2005. 


e p= 1000 FTES 

e median = 1,014 FTES 

e 0 =474 FTES 

e first quartile = 528.5 FTES 

e third quartile = 1,447.5 FTES 
¢ n= 29 years 


Exercise: 


Problem: 


A sample of 11 years is taken. About how many are expected to have a FTES of 1014 or above? Explain 
how you determined your answer. 


Solution: 


The median value is the middle value in the ordered list of data values. The median value of a set of 11 will 
be the 6th number in order. Six years will have totals at or below the median. 


Exercise: 


Problem: 75% of all years have an FTES: 


a. at or below: 
b. at or above: 


Exercise: 


Problem: The population standard deviation = 


Solution: 
474 FTES 


Exercise: 


Problem: What percent of the FTES were from 528.5 to 1447.5? How do you know? 
Exercise: 

Problem: What is the IQR? What does the IQR represent? 

Solution: 

919 
Exercise: 

Problem: How many standard deviations away from the mean is the median? 


Additional Information: The population FTES for 2005-2006 through 2010-2011 was given in an updated 
report. The data are reported here. 


Year 2005-06 2006-07 2007-08 2008-09 2009-10 2010-11 
Total FTES 1,585 1,690 1,735 1,935 2,021 1,890 
Exercise: 
Problem: 


Calculate the mean, median, standard deviation, the first quartile, the third quartile and the IQR. Round to 
one decimal place. 


Solution: 


e mean = 1,809.3 


e median = 1,812.5 
standard deviation = 151.2 
first quartile = 1,690 

third quartile = 1,935 

e IQR= 245 


Exercise: 
Problem: 


Compare the JQR for the FTES for 1976-77 through 2004—2005 with the IQR for the FTES for 2005-2006 
through 2010-2011. Why do you suppose the JQRs are so different? 


Solution: 
Hint: Think about the number of years covered by each time period and what happened to higher education 
during those periods. 
Exercise: 
Problem: 
Three students were applying to the same graduate school. They came from schools with different grading 


systems. Which student had the best GPA when compared to other students at his school? Explain how you 
determined your answer. 


Student GPA School Average GPA School Standard Deviation 
Thuy 27, 3.2 0.8 
Vichet 87 75 20 
Kamala 8.6 8 0.4 
Exercise: 
Problem: 


A music school has budgeted to purchase three musical instruments. They plan to purchase a piano costing 
$3,000, a guitar costing $550, and a drum set costing $600. The mean cost for a piano is $4,000 with a 
standard deviation of $2,500. The mean cost for a guitar is $500 with a standard deviation of $200. The 
mean cost for drums is $700 with a standard deviation of $100. Which cost is the lowest, when compared 
to other instruments of the same type? Which cost is the highest when compared to other instruments of the 
same type. Justify your answer. 


Solution: 


For pianos, the cost of the piano is 0.4 standard deviations BELOW the mean. For guitars, the cost of the 
guitar is 0.25 standard deviations ABOVE the mean. For drums, the cost of the drum set is 1.0 standard 
deviations BELOW the mean. Of the three, the drums cost the lowest in comparison to the cost of other 
instruments of the same type. The guitar costs the most in comparison to the cost of other instruments of 
the same type. 


Exercise: 
Problem: 
An elementary school class ran one mile with a mean of 11 minutes and a standard deviation of three 
minutes. Rachel, a student in the class, ran one mile in eight minutes. A junior high school class ran one 
mile with a mean of nine minutes and a standard deviation of two minutes. Kenji, a student in the class, ran 


1 mile in 8.5 minutes. A high school class ran one mile with a mean of seven minutes and a standard 
deviation of four minutes. Nedda, a student in the class, ran one mile in eight minutes. 


a. Why is Kenji considered a better runner than Nedda, even though Nedda ran faster than he? 
b. Who is the fastest runner with respect to his or her class? Explain why. 


Exercise: 
Problem: 


The most obese countries in the world have obesity rates that range from 11.4% to 74.6%. This data is 
summarized in Table 14. 


Percent of population obese Number of countries 
11.4—20.45 29 

20.45—29.45 13 

29.45—38.45 4 

38.45—47.45 0 

47.45-56.45 2 

56.45-65.45 1 

65.45—74.45 0 

74,45-83.45 1 


What is the best estimate of the average obesity percentage for these countries? What is the standard 
deviation for the listed obesity rates? The United States has an average obesity rate of 33.9%. Is this rate 
above average or below? How “unusual” is the United States’ obesity rate compared to the average rate? 
Explain. 


Solution: 


e © = 23.32 

e Using the TI 83/84, we obtain a standard deviation of: s, = 12.95. 

e The obesity rate of the United States is 10.58% higher than the average obesity rate. 

e Since the standard deviation is 12.95, we see that 23.32 + 12.95 = 36.27 is the obesity percentage that 
is one standard deviation from the mean. The United States obesity rate is slightly less than one 


standard deviation from the mean. Therefore, we can assume that the United States, while 34% obese, 
does not hav e an unusually high percentage of obese people. 


Exercise: 


Problem: [link] gives the percent of children under five considered to be underweight. 


Percent of underweight children Number of countries 
16—21.45 23 

21.45-26.9 4 

26.9-32.35 9 

32.35-37.8 7 

37.8-43.25 6 

43.25-48.7 1 


What is the best estimate for the mean percentage of underweight children? What is the standard deviation? 
Which interval(s) could be considered unusual? Explain. 
Bringing It Together 


Exercise: 


Problem: 


Twenty-five randomly selected students were asked the number of movies they watched the previous week. 
The results are as follows: 


# of movies Frequency 
0 5 
1 9 
2 6 


# of movies Frequency 


4 1 


a. Find the sample mean Z. 
b. Find the approximate sample standard deviation, s. 


Solution: 


a. 1.48 
b. 1.12 


Exercise: 
Problem: 


Forty randomly selected students were asked the number of pairs of sneakers they owned. Let X = the 
number of pairs of sneakers owned. The results are as follows: 


X Frequency 
1 2 

2 5 

3 8 

4 12 

5 12 

6 0 

7 1 


a. Find the sample mean % 

b. Find the sample standard deviation, s 
c. Construct a histogram of the data. 

d. Complete the columns of the chart. 
e. Find the first quartile. 

f. Find the median. 

g. Find the third quartile. 

h. What percent of the students owned at least five pairs? 
i. Find the 40" percentile. 

j. Find the 90" percentile. 

k. Construct a line graph of the data 

1. Construct a stemplot of the data 


Exercise: 


Problem: 


Following are the published weights (in pounds) of all of the team members of the San Francisco 49ers 
from a previous year. 


177; 205; 210; 210; 232; 205; 185; 185; 178; 210; 206; 212; 184; 174; 185; 242; 188; 212; 215; 247; 241; 
223; 220; 260; 245; 259; 278; 270; 280; 295; 275; 285; 290; 272; 273; 280; 285; 286; 200; 215; 185; 230; 
250; 241; 190; 260; 250; 302; 265; 290; 276; 228; 265 


a. Organize the data from smallest to largest value. 

b. Find the median. 

c. Find the first quartile. 

d. Find the third quartile. 

e. The middle 50% of the weights are from to 

f. If our population were all professional football players, would the above data be a sample of weights 
or the population of weights? Why? 

g. If our population included every team member who ever played for the San Francisco 49ers, would 
the above data be a sample of weights or the population of weights? Why? 

h. Assume the population was the San Francisco 49ers. Find: 


i. the population mean, i. 
ii. the population standard deviation, o. 
iii. the weight that is two standard deviations below the mean. 
iv. When Steve Young, quarterback, played football, he weighed 205 pounds. How many standard 
deviations above or below the mean was he? 


i. That same year, the mean weight for the Dallas Cowboys was 240.08 pounds with a standard 
deviation of 44.38 pounds. Emmit Smith weighed in at 209 pounds. With respect to his team, who was 
lighter, Smith or Young? How did you determine your answer? 


Solution: 


a. 174; 177; 178; 184; 185; 185; 185; 185; 188; 190; 200; 205; 205; 206; 210; 210; 210; 212; 212; 215; 
215; 220; 223; 228; 230; 232; 241; 241; 242; 245; 247; 250; 250; 259; 260; 260; 265; 265; 270; 272; 
273; 275; 276; 278; 280; 280; 285; 285; 286; 290; 290; 295; 302 

b. 241 

c. 205.5 

d. 272.5 

e. 205.5, 272.5 

f. sample 

g. population 

h 


i. 236.34 
ii. 37.50 
iii. 161.34 
iv. 0.84 std. dev. below the mean 


i. Young 


Exercise: 


Problem: 


One hundred teachers attended a seminar on mathematical problem solving. The attitudes of a 
representative sample of 12 of the teachers were measured before and after the seminar. A positive number 
for change in attitude indicates that a teacher's attitude toward math became more positive. The 12 change 
scores are as follows: 


3 8-12 05-31-16 5-2 


a. What is the mean change score? 

b. What is the standard deviation for this population? 

c. What is the median change score? 

d. Find the change score that is 2.2 standard deviations below the mean. 


Exercise: 


Problem: 


Refer to [link] determine which of the following are true and which are false. Explain your solution to each 
part in complete sentences. 


123 45 123 4 5 
(a) (b) 


a. The medians for both graphs are the same. 

b. We cannot determine if any of the means for both graphs is different. 

c. The standard deviation for graph b is larger than the standard deviation for graph a. 
d. We cannot determine if any of the third quartiles for both graphs is different. 


Solution: 


a. True 
b. True 
c. True 
d. False 


Exercise: 


Problem: 


Ina recent issue of the IEEE Spectrum, 84 engineering conferences were announced. Four conferences 
lasted two days. Thirty-six lasted three days. Eighteen lasted four days. Nineteen lasted five days. Four 
lasted six days. One lasted seven days. One lasted eight days. One lasted nine days. Let X = the length (in 
days) of an engineering conference. 


a. Organize the data in a chart. 

b. Find the median, the first quartile, and the third quartile. 

c. Find the 65" percentile. 

d. Find the 10" percentile. 

e. The middle 50% of the conferences last from days to days. 


Calculate the sample mean of days of engineering conferences. 

Calculate the sample standard deviation of days of engineering conferences. 

Find the mode. 

If you were planning an engineering conference, which would you choose as the length of the 
conference: mean; median; or mode? Explain why you made that choice. 

. Give two reasons why you think that three to five days seem to be popular lengths of engineering 
conferences. 


mr Eda rp 


— 


Exercise: 


Problem: 
A survey of enrollment at 35 community colleges across the United States yielded the following figures: 


6414; 1550; 2109; 9350; 21828; 4300; 5944; 5722; 2825; 2044; 5481; 5200; 5853; 2750; 10012; 6357; 
27000; 9414; 7681; 3200; 17500; 9200; 7380; 18314; 6557; 13713; 17768; 7493; 2771; 2861; 1263; 7285; 
28165; 5080; 11622 


a. Organize the data into a chart with five intervals of equal width. Label the two columns "Enrollment" 
and "Frequency." 

b. Construct a histogram of the data. 

c. If you were to build a new community college, which piece of information would be more valuable: 
the mode or the mean? 

d. Calculate the sample mean. 

e. Calculate the sample standard deviation. 

f. A school with an enrollment of 8000 would be how many standard deviations away from the mean? 


Solution: 

a. Enrollment Frequency 
1000-5000 10 
5000-10000 16 
10000-15000 3 
15000-20000 3 
20000-25000 1 
25000-30000 2 


b. Check student’s solution. 
c. mode 

d. 8628.74 

e. 6943.88 

f. -0.09 


Use the following information to answer the next two exercises. X = the number of days per week that 100 
clients use a particular exercise facility. 


xX Frequency 
0 3 

1 12 

2 33 

3 28 

4 11 

5 9 

6 4 

Exercise: 


Problem: The 80" percentile is 


ono op 
S 


RWoOUW 


Exercise: 


Problem: The number that is 1.5 standard deviations BELOW the mean is approximately 


a. 0.7 

b. 4.8 

c. —2.8 

d. Cannot be determined 


Solution: 


a 
Exercise: 
Problem: 


Suppose that a publisher conducted a survey asking adult consumers the number of fiction paperback 
books they had purchased in the previous month. The results are summarized in the [link]. 


# of books Freq. Rel. Freq. 


0 18 
1 24 
2 24 
3 22 
4 15 
5 10 
7 5 

9 1 


a. Are there any outliers in the data? Use an appropriate numerical test involving the [QR to identify 
outliers, if any, and clearly state your conclusion. 

b. If a data value is identified as an outlier, what should be done about it? 

c. Are any data values further than two standard deviations away from the mean? In some situations, 
statisticians may use this criteria to identify data values that are unusual, compared to the other data 
values. (Note that this criteria is most appropriate to use for data that is mound-shaped and symmetric, 
rather than for skewed data.) 

d. Do parts a and c of this problem give the same answer? 

e. Examine the shape of the data. Which part, a or c, of this question gives a more appropriate result for 
this data? 

f. Based on the shape of the data which is the most appropriate measure of center for this data: mean, 
median or mode? 


Glossary 


Standard Deviation 
a number that is equal to the square root of the variance and measures how far data values are from their 
mean; notation: s for sample standard deviation and o for population standard deviation. 


Variance 
mean of the squared deviations from the mean, or the square of the standard deviation; for a set of data, a 
deviation can be represented as x — x where x is a value of the data and z is the sample mean. The sample 
variance is equal to the sum of the squares of the deviations divided by the difference of the sample size 
and one. 


Introduction 
class="introduction' 


Meteor 
showers are 
rare, but the 

probability of 
them occurring 
can be 
calculated. 
(credit: 
Navicore/flickr 


) 


It is often necessary to "guess" about the outcome of an event in order to 
make a decision. Politicians study polls to guess their likelihood of winning 
an election. Teachers choose a particular course of study based on what they 
think students can comprehend. Doctors choose the treatments needed for 
various diseases based on their assessment of likely results. You may have 
visited a casino where people play games chosen because of the belief that 
the likelihood of winning is good. You may have chosen your course of 
study based on the probable availability of jobs. 


You have, more than likely, used probability. In fact, you probably have an 
intuitive sense of probability. Probability deals with the chance of an event 
occurring. Whenever you weigh the odds of whether or not to do your 
homework or to study for an exam, you are using probability. In this 
chapter, you will learn how to solve probability problems using a systematic 
approach. 


Terminology 


Probability is a measure that is associated with how certain we are of 
outcomes of a particular experiment or activity. An experiment is a 
planned operation carried out under controlled conditions. If the result is 
not predetermined, then the experiment is said to be a chance experiment. 
Flipping one fair coin twice is an example of an experiment. 


A result of an experiment is called an outcome. The sample space of an 
experiment is the set of all possible outcomes. Three ways to represent a 
sample space are: to list the possible outcomes, to create a tree diagram, or 
to create a Venn diagram. The uppercase letter S is used to denote the 
sample space. For example, if you flip one fair coin, S = {H, T} where H = 
heads and T = tails are the outcomes. 


An event is any combination of outcomes. Upper case letters like A and B 
represent events. For example, if the experiment is to flip one fair coin, 
event A might be getting at most one head. The probability of an event A is 
written P(A). 


The probability of any outcome is the long-term relative frequency of 
that outcome. Probabilities are between zero and one, inclusive (that is, 
zero and one and all numbers between these values). P(A) = 0 means the 
event A can never happen. P(A) = 1 means the event A always happens. 
P(A) = 0.5 means the event A is equally likely to occur or not to occur. For 
example, if you flip one fair coin repeatedly (from 20 to 2,000 to 20,000 
times) the relative frequency of heads approaches 0.5 (the probability of 
heads). 


Equally likely means that each outcome of an experiment occurs with 
equal probability. For example, if you toss a fair, six-sided die, each face 
(1, 2, 3, 4, 5, or 6) is as likely to occur as any other face. If you toss a fair 
coin, a Head (H) and a Tail (T) are equally likely to occur. If you randomly 
guess the answer to a true/false question on an exam, you are equally likely 
to select a correct answer or an incorrect answer. 


To calculate the probability of an event A when all outcomes in the 
sample space are equally likely, count the number of outcomes for event A 


and divide by the total number of outcomes in the sample space. For 
example, if you toss a fair dime and a fair nickel, the sample space is {HH, 
TH, HT, TT} where T = tails and H = heads. The sample space has four 
outcomes. A = getting one head. There are two outcomes that meet this 
condition {HT, TH}, so P(A) = + = 0.5. 

Suppose you roll one fair six-sided die, with the numbers {1, 2, 3, 4, 5, 6} 
on its faces. Let event E = rolling a number that is at least five. There are 
two outcomes {5, 6}. P(E) = 2. If you were to roll the die only a few times, 


you would not be surprised if your observed results did not match the 
probability. If you were to roll the die a very large number of times, you 
would expect that, overall, 2 of the rolls would result in an outcome of "at 


least five". You would not expect exactly 2. The long-term relative 


frequency of obtaining this result would approach the theoretical probability 
of ~ as the number of repetitions grows larger and larger. 


This important characteristic of probability experiments is known as the 
law of large numbers which states that as the number of repetitions of an 
experiment is increased, the relative frequency obtained in the experiment 
tends to become closer and closer to the theoretical probability. Even 
though the outcomes do not happen according to any set pattern or order, 
overall, the long-term observed relative frequency will approach the 
theoretical probability. (The word empirical is often used instead of the 
word observed.) 


It is important to realize that in many situations, the outcomes are not 
equally likely. A coin or die may be unfair, or biased. Two math professors 
in Europe had their statistics students test the Belgian one Euro coin and 
discovered that in 250 trials, a head was obtained 56% of the time and a tail 
was obtained 44% of the time. The data seem to show that the coin is not a 
fair coin; more repetitions would be helpful to draw a more accurate 
conclusion about such bias. Some dice may be biased. Look at the dice in a 
game you have at home; the spots on each face are usually small holes 
carved out and then painted to make the spots visible. Your dice may or 
may not be biased; it is possible that the outcomes may be affected by the 
slight weight differences due to the different numbers of holes in the faces. 


Gambling casinos make a lot of money depending on outcomes from rolling 
dice, so casino dice are made differently to eliminate bias. Casino dice have 
flat faces; the holes are completely filled with paint having the same density 
as the material that the dice are made out of so that each face is equally 
likely to occur. Later we will learn techniques to use to work with 
probabilities for events that are not equally likely. 


"U" Event: The Union 

An outcome is in the event A U B if the outcome is in A or is in B or is in 
both A and B. For example, let A = {1, 2, 3, 4, 5} and B= {4, 5, 6, 7, 8}.A 
U B= {1, 2, 3, 4, 5, 6, 7, 8}. Notice that 4 and 5 are NOT listed twice. 


"()" Event: The Intersection 

An outcome is in the event AM B if the outcome is in both A and B at the 
same time. For example, let A and B be {1, 2, 3, 4, 5} and {4, 5, 6, 7, 8}, 
respectively. Then AM B = {4, 5}. 


The complement of event A is denoted A' (read "A prime"). A’ consists of 
all outcomes that are NOT in A. Notice that P(A) + P(A’) = 1. For example, 
let S = {1, 2, 3, 4, 5, 6} and let A = {1, 2, 3, 4}. Then, A’= {5, 6}. P(A) = 4 
, P(A) = 2, and P(A) + P(A) = 442 =1 


The conditional probability of A given B is written P(A|B). P(A|B) is the 
probability that event A will occur given that the event B has already 
occurred. A conditional reduces the sample space. We calculate the 
probability of A from the reduced sample space B. The formula to calculate 


5 P(ANB 
P(A|B) is P(A|B) = Sun 


where P(B) is greater than zero. 


For example, suppose we toss one fair, six-sided die. The sample space S = 
{1, 2, 3, 4, 5, 6}. Let A = face is 2 or 3 and B = face is even (2, 4, 6). To 
calculate P(A|B), we count the number of outcomes 2 or 3 in the sample 
space B= {2, 4, 6}. Then we divide that by the number of outcomes B 
(rather than S). 


We get the same result by using the formula. Remember that S has six 
outcomes. 


(the number of outcomes that are 2 or 3 and even in S) 1 
Ans! = ae 
P( B) = (the number of outcomes that are even in S) PB 

————"—= "= «65. 2 ==. 6 


i 
3 


Odds 

The odds of an event presents the probability as a ratio of success to failure. 
This is common in various gambling formats. Mathematically, the odds of 
an event can be defined as: 

Equation: 


P(A) 
1— P(A) 


where P(A) is the probability of success and of course 1 — P(A) is the 
probability of failure. Odds are always quoted as "numerator to 
denominator," e.g. 2 to 1. Here the probability of winning is twice that of 
losing; thus, the probability of winning is 0.66. A probability of winning of 
0.60 would generate odds in favor of winning of 3 to 2. While the 
calculation of odds can be useful in gambling venues in determining payoff 
amounts, it is not helpful for understanding probability or statistical theory. 


Understanding Terminology and Symbols 

It is important to read each problem carefully to think about and understand 
what the events are. Understanding the wording is the first very important 
step in solving probability problems. Reread the problem several times if 
necessary. Clearly identify the event of interest. Determine whether there is 
a condition stated in the wording that would indicate that the probability is 
conditional; carefully identify the condition, if any. 


Example: 
Exercise: 


Problem: 


The sample space S is the whole numbers starting at one and less than 
20. 


ao 
Let event A = the even numbers and event B = numbers greater 
than 13. 

b.A= , B= 

c. P(A) = , P(B) = 

d.AN B= ,AOR B= 

e. P(AN B)= , P(AU B)= 

f. A'= , P(A’) = 

g. P(A) + P(A’) = 

h. P(A|B) = , P(BJA) = ; are the 


probabilities equal? 


Solution: 


aps = (1 Atte) oo lO ek yaad tole onto} 
{ CoO ioe G iG oR =i S sla alert Oy 

=f 

d. A B= {1416.18}, AOR B= {2, 4,6, 8; 10, 12,14, 15; 16, 17, 
18, 19} 

e. P(AN B) = %, PPAUB)= +2 

OW es Boho erenh File tae 17 19; P(A) = 72 

g. eas om Pa a 1) 


h. P(A|B) = ~Sa) = 2, peplay = “SF = 3, No 


Note: 
Try It 


Exercise: 


Problem: 
The sample space S is all the ordered pairs of two whole numbers, the 
first from one to three and the second from one to four (Example: (1, 
4)). 

a. S= 


Let event A = the sum is even and event B = the first number is 


prime. 
b.A= , B= 
c. P(A) = , P(B) = 
d.AN B= ,AUB= 
e. P(AN B) = .P(AUB)= 
f B= , P(B) = 
g. P(A) + P(A) = 
h. P(A|B) = , P(BJA) = ; are the 


probabilities equal? 


Solution: 


dias — Glo ly (led alos 1a) (2s l) (223) a2) (aloe), 
(3,3), (3,4)5 
b. A= {(1,), (1,3), (2,2), (2,4), (3,1), (3,3)} 


B= {(2, D), (22); UG 3), (2,4), (3,1), (3,2), (3,3), (3,4)4 
c. P(A) = 4, P(B) = 
d.ANB= “{(2,2), (2,4), 3.) 3.3)} 


PENS = WUD TEI se) ha NIL) OLPA Is O27 ENOL (Gh Ghsny 
(3,4)} 

e. P(AM B)= 3, (PAUB)= 2 

f. B' = {(1,1), (1,2), (1,3), (1,4)}, P(B) = + 


9. P(B) + P(B’)=1 


P(ANB P(ANB 
bh. P(A|B) = 5 = 4, P(BIA) = Sa =F, No. 
Example: 
Exercise: 
Problem: 


A fair, six-sided die is rolled. Describe the sample space S, identify 
each of the following events with a subset of S and compute its 
probability (an outcome is the number of dots that show up). 


a. Event T = the outcome is two. 

b. Event A = the outcome is an even number. 
c. Event B = the outcome is less than four. 
d. The complement of A. 

e.A|B 

f.Bi|A 

g.AMB 

hAUB 

Pears: 

j. Event N = the outcome is a prime number. 
k. Event J = the outcome is seven. 


Solution: 


a. T= {2}, P(T)= + 

b. A= {2, 4, 6}, P(A) = + 
c. B= {1, 2, 3}, P(B) = + 

d.A'= {1, 3, 5}, P(A) = 
e. A|B = {2}, P(A|B) = + 

f. BJA = {2}, P(BJA) = = 


g. ANB = {2}, PAN B)= = 

h. AU B = {1, 2, 3, 4, 6}, P(A U B) = 2 

i. AU B’= {2, 4, 5, 6}, PAUB) = 2 

j. N= {2, 3, 5}, PIN) = > 

k. A six-sided die does not have seven dots. P(7) = 0. 


Example: 
[link] describes the distribution of a random sample S of 100 individuals, 
organized by gender and whether they are right- or left-handed. 


Right-handed Left-handed 
Males 43 9 
Females 44 4 
Exercise: 
Problem: 


Let’s denote the events M = the subject is male, F = the subject is 
female, R = the subject is right-handed, L = the subject is left-handed. 
Compute the following probabilities: 


a. P(M) 
b. P(F) 
c. P(R) 
dae (i) 
e. PM 1 R) 


f, (FOL) 
g. P(M U F) 
h. P(M U R) 

i. P(F UL) 
j. P(M’) 

k. P(R|M) 

L. P(F|L) 
m. P(L|F) 


Solution: 


a. P(M) = 0.52 

b. P(F) = 0.48 

c. P(R) = 0.87 

d. P(L) = 0.13 

e. P(M 1 R) = 0.43 

f. PE ML) = 0.04 

g.P(MUF)=1 

h. P(M U R) = 0.96 

Tee Ui) = 0.57 

j. PWM’) = 0.48 

k. P(R|M) = 0.8269 (rounded to four decimal places) 
|. P(F|L) = 0.3077 (rounded to four decimal places) 
m. P(L|F) = 0.0833 


References 
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Chapter Review 


In this module we learned the basic terminology of probability. The set of 
all possible outcomes of an experiment is called the sample space. Events 
are subsets of the sample space, and they are assigned a probability that is a 
number between zero and one, inclusive. 


Formula Review 
A and B are events 
P(S) = 1 where S is the sample space 


0<P(A)<1 


P(A|B) = 


Exercise: 


P(ANB) 
P(B) 


Problem: 


In a particular college class, there are male and female students. Some 
students have long hair and some students have short hair. Write the 
symbols for the probabilities of the events for parts a through j. (Note 
that you cannot find numerical answers here. You were not given 
enough information to find any probability values yet; concentrate on 
understanding the symbols.) 


e Let F be the event that a student is female. 

e Let M be the event that a student is male. 

e Let S be the event that a student has short hair. 
e Let L be the event that a student has long hair. 


. The probability that a student does not have long hair. 

. The probability that a student is male or has short hair. 

. The probability that a student is a female and has long hair. 

. The probability that a student is male, given that the student has 
long hair. 

e. The probability that a student has long hair, given that the student 

is male. 


ano Dp 


f. Of all the female students, the probability that a student has short 
hair. 

g. Of all students with long hair, the probability that a student is 
female. 

h. The probability that a student is female or has long hair. 

i. The probability that a randomly selected student is a male student 
with short hair. 

j. The probability that a student is female. 


Solution: 


a. P(L') = P(S) 
b. P(M US) 

c. POF OL) 

d. P(MIL) 

e. P(L|M) 

f. P(S|F) 

g. P(FIL) 

h. P(F UL) 

i. PMS) 

j. P(F) 


Use the following information to answer the next four exercises. A box is 
filled with several party favors. It contains 12 hats, 15 noisemakers, ten 
finger traps, and five bags of confetti. 

Let H = the event of getting a hat. 

Let N = the event of getting a noisemaker. 

Let F = the event of getting a finger trap. 

Let C = the event of getting a bag of confetti. 

Exercise: 


Problem:Find P(A). 


Exercise: 


Problem: Find P(N). 
Solution: 


eae ieee) See 
P(N) = 22 = 5 =0.36 


Exercise: 


Problem:Find P(F). 


Exercise: 


Problem:Find P(C). 


Solution: 


Use the following information to answer the next six exercises. A jar of 150 
jelly beans contains 22 red jelly beans, 38 yellow, 20 green, 28 purple, 26 
blue, and the rest are orange. 

Let B = the event of getting a blue jelly bean 

Let G = the event of getting a green jelly bean. 

Let O = the event of getting an orange jelly bean. 

Let P = the event of getting a purple jelly bean. 

Let R = the event of getting a red jelly bean. 

Let Y = the event of getting a yellow jelly bean. 

Exercise: 


Problem:Find P(B). 


Exercise: 


Problem:Find P(G). 


Solution: 


150 15 
Exercise: 
Problem:Find P(P). 
Exercise: 


Problem: Find P(R). 
Solution: 


Pie = = 015 


Exercise: 


Problem: Find P(Y). 


Exercise: 


Problem:Find P(O). 


Solution: 


P(O) = 150—22—38—20—28—26 _ 16 _ 8 -0.11 


150 150 75 


Use the following information to answer the next six exercises. There are 23 
countries in North America, 12 countries in South America, 47 countries in 
Europe, 44 countries in Asia, 54 countries in Africa, and 14 in Oceania 
(Pacific Ocean region). 

Let A = the event that a country is in Asia. 

Let E = the event that a country is in Europe. 

Let F = the event that a country is in Africa. 

Let N = the event that a country is in North America. 


Let O = the event that a country is in Oceania. 
Let S = the event that a country is in South America. 
Exercise: 


Problem: Find P(A). 
Exercise: 


Problem:Find P(E). 


Solution: 


P(E) = 45 = 0.24 


Exercise: 


Problem: Find P(F). 


Exercise: 


Problem: Find P(N). 
Solution: 


P(N) = 4% = 0.12 


Exercise: 


Problem:Find P(O). 


Exercise: 


Problem: Find P(S). 
Solution: 


P(S) = 7, = & = 0.06 


Exercise: 
Problem: 
What is the probability of drawing a red card in a standard deck of 52 
cards? 
Exercise: 
Problem: 


What is the probability of drawing a club in a standard deck of 52 
cards? 


Solution: 


13 
52 


+ = 0.25 
Exercise: 
Problem: 
What is the probability of rolling an even number of dots with a fair, 
six-sided die numbered one through six? 
Exercise: 
Problem: 


What is the probability of rolling a prime number of dots with a fair, 
six-sided die numbered one through six? 


Solution: 


Use the following information to answer the next two exercises. You see a 
game at a local fair. You have to throw a dart at a color wheel. Each section 
on the color wheel is equal in area. 


Let B = the event of landing on blue. 
Let R = the event of landing on red. 
Let G = the event of landing on green. 
Let Y = the event of landing on yellow. 
Exercise: 


Problem: If you land on Y, you get the biggest prize. Find P(Y). 


Exercise: 


Problem: If you land on red, you don’t get a prize. What is P(R)? 


Solution: 


Use the following information to answer the next ten exercises. On a 
baseball team, there are infielders and outfielders. Some players are great 
hitters, and some players are not great hitters. 

Let J = the event that a player in an infielder. 

Let O = the event that a player is an outfielder. 

Let H = the event that a player is a great hitter. 

Let N = the event that a player is not a great hitter. 


Exercise: 


Problem: 


Write the symbols for the probability that a player is not an outfielder. 
Exercise: 
Problem: 


Write the symbols for the probability that a player is an outfielder or is 
a great hitter. 


Solution: 


P(OU H) 
Exercise: 
Problem: 
Write the symbols for the probability that a player is an infielder and is 
not a great hitter. 
Exercise: 
Problem: 


Write the symbols for the probability that a player is a great hitter, 
given that the player is an infielder. 


Solution: 
P(H|D) 
Exercise: 


Problem: 


Write the symbols for the probability that a player is an infielder, given 
that the player is a great hitter. 


Exercise: 


Problem: 


Write the symbols for the probability that of all the outfielders, a 
player is not a great hitter. 


Solution: 
P(N|O) 
Exercise: 


Problem: 


Write the symbols for the probability that of all the great hitters, a 
player is an outfielder. 

Exercise: 
Problem: 


Write the symbols for the probability that a player is an infielder or is 
not a great hitter. 


Solution: 


PU UN) 
Exercise: 


Problem: 


Write the symbols for the probability that a player is an outfielder and 
is a great hitter. 


Exercise: 


Problem: 
Write the symbols for the probability that a player is an infielder. 


Solution: 


PUD) 


Exercise: 


Problem: What is the word for the set of all possible outcomes? 


Exercise: 


Problem: What is conditional probability? 
Solution: 
The likelihood that an event will occur given that another event has 
already occurred. 

Exercise: 
Problem: 
A shelf holds 12 books. Eight are fiction and the rest are nonfiction. 
Each is a different book with a unique title. The fiction books are 
numbered one to eight. The nonfiction books are numbered one to 
four. Randomly select one book 
Let F = event that book is fiction 


Let N = event that book is nonfiction 
What is the sample space? 


Exercise: 


Problem: 
What is the sum of the probabilities of an event and its complement? 
Solution: 


1 


Use the following information to answer the next two exercises. You are 


rolling a fair, six-sided number cube. Let E = the event that it lands on an 
even number. Let M = the event that it lands on a multiple of three. 
Exercise: 


Problem: What does P(E|M) mean in words? 


Exercise: 


Problem: What does P(E U M) mean in words? 
Solution: 


the probability of landing on an even number or a multiple of three 


Homework 


Exercise: 


Problem: 
1200 100% 
1000 


800 + 


Total 18-34 35-44 45-54 55-64 65+ Male Female 
@ Sample © Percentapprove © Percent disapprove 


The graph in [link] displays the sample sizes and percentages of people 
in different age and gender groups who were polled concerning their 
approval of Mayor Ford’s actions in office. The total number in the 
sample of all the age groups is 1,045. 


a. Define three events in the graph. 
b. Describe in words what the entry 40 means. 
c. Describe in words the complement of the entry in question 2. 


d. Describe in words what the entry 30 means. 

e. Out of the males and females, what percent are males? 

f. Out of the females, what percent disapprove of Mayor Ford? 

g. Out of all the age groups, what percent approve of Mayor Ford? 
h. Find P(Approve|Male). 

i. Out of the age groups, what percent are more than 44 years old? 
j. Find P(Approve|Age < 35). 


Exercise: 


Problem: 


Explain what is wrong with the following statements. Use complete 
sentences. 


a. If there is a 60% chance of rain on Saturday and a 70% chance of 
rain on Sunday, then there is a 130% chance of rain over the 
weekend. 

b. The probability that a baseball player hits a home run is greater 
than the probability that he gets a successful hit. 


Solution: 


a. You can't calculate the joint probability knowing the probability 
of both events occurring, which is not in the information given; 
the probabilities should be multiplied, not added; and probability 
is never greater than 100% 

b. A home run by definition is a successful hit, so he has to have at 
least as many successful hits as home runs. 


Glossary 


Conditional Probability 
the likelihood that an event will occur given that another event has 
already occurred 


Equally Likely 
Each outcome of an experiment has the same probability. 


Event 
a subset of the set of all outcomes of an experiment; the set of all 
outcomes of an experiment is called a sample space and is usually 
denoted by S. An event is an arbitrary subset in S. It can contain one 
outcome, two outcomes, no outcomes (empty subset), the entire 
sample space, and the like. Standard notations for events are capital 
letters such as A, B, C, and so on. 


Experiment 
a planned activity carried out under controlled conditions 


Outcome 
a particular result of an experiment 


Probability 
a number between zero and one, inclusive, that gives the likelihood 
that a specific event will occur; the foundation of statistics is given by 
the following 3 axioms (by A.N. Kolmogorov, 1930’s): Let S denote 
the sample space and A and B are two events in S. Then: 


© 0<P(A)<1 

e If Aand B are any two mutually exclusive events, then P(A U B) 
= P(A) + P(B). 

e P(S)=1 


Sample Space 
the set of all possible outcomes of an experiment 


The Intersection: the M Event 
An outcome is in the event A B if the outcome is in both AM B at the 
Same time. 


The Complement Event 
The complement of event A consists of all outcomes that are NOT in 
A. 


The Conditional Probability of A | B 
P(A|B) is the probability that event A will occur given that the event B 
has already occurred. 


The Union: the U Event 
An outcome is in the event A U B if the outcome is in A or is in B or is 
in both A and B. 


Independent and Mutually Exclusive Events 


Independent and mutually exclusive do not mean the same thing. 


Independent Events 


Two events are independent if one of the following are true: 


* P(A|B) = P(A) 
* P(B|A) = P(B) 
* P(AN B) = P(A)P(B) 


Two events A and B are independent if the knowledge that one occurred does not affect the chance the other 
occurs. For example, the outcomes of two roles of a fair die are independent events. The outcome of the first roll 
does not change the probability for the outcome of the second roll. To show two events are independent, you must 
show only one of the above conditions. If two events are NOT independent, then we say that they are dependent. 


Sampling may be done with replacement or without replacement. 


¢ With replacement: If each member of a population is replaced after it is picked, then that member has the 
possibility of being chosen more than once. When sampling is done with replacement, then events are 
considered to be independent, meaning the result of the first pick will not change the probabilities for the 
second pick. 

¢ Without replacement: When sampling is done without replacement, each member of a population may be 
chosen only once. In this case, the probabilities for the second pick are affected by the result of the first pick. 
The events are considered to be dependent or not independent. 


If it is not known whether A and B are independent or dependent, assume they are dependent until you can show 
otherwise. 


Example: 

You have a fair, well-shuffled deck of 52 cards. It consists of four suits. The suits are clubs, diamonds, hearts and 
spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, J Gack), Q (queen), K (king) of that 
suit. 

a. Sampling with replacement: 

Suppose you pick three cards with replacement. The first card you pick out of the 52 cards is the Q of spades. You 
put this card back, reshuffle the cards and pick a second card from the 52-card deck. It is the ten of clubs. You put 
this card back, reshuffle the cards and pick a third card from the 52-card deck. This time, the card is the Q of 
spades again. Your picks are {Q of spades, ten of clubs, Q of spades}. You have picked the Q of spades twice. 
You pick each card from the 52-card deck. 

b. Sampling without replacement: 

Suppose you pick three cards without replacement. The first card you pick out of the 52 cards is the K of hearts. 
You put this card aside and pick the second card from the 51 cards remaining in the deck. It is the three of 
diamonds. You put this card aside and pick the third card from the remaining 50 cards in the deck. The third card 
is the J of spades. Your picks are {K of hearts, three of diamonds, J of spades}. Because you have picked the 
cards without replacement, you cannot pick the same card twice. The probability of picking the three of diamonds 
is called a conditional probability because it is conditioned on what was picked first. This is true also of the 
probability of picking the J of spades. The probability of picking the J of spades is actually conditioned on both 
the previous picks. 


Note: 
Try It 
Exercise: 


Problem: 


You have a fair, well-shuffled deck of 52 cards. It consists of four suits. The suits are clubs, diamonds, hearts 
and spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, J (jack), Q (queen), K 
(king) of that suit. Three cards are picked at random. 


a. Suppose you know that the picked cards are Q of spades, K of hearts and Q of spades. Can you decide if 
the sampling was with or without replacement? 

b. Suppose you know that the picked cards are Q of spades, K of hearts, and J of spades. Can you decide if 
the sampling was with or without replacement? 


Solution: 


a. With replacement 
b. No 


Example: 
Exercise: 


Problem: 


You have a fair, well-shuffled deck of 52 cards. It consists of four suits. The suits are clubs, diamonds, hearts, 
and spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, J (jack), Q (queen), and K 
(king) of that suit. S = spades, H = Hearts, D = Diamonds, C = Clubs. 


a. Suppose you pick four cards, but do not put any cards back into the deck. Your cards are QS, 1D, 1C, 
QD. 

b. Suppose you pick four cards and put each card back before you pick the next card. Your cards are KH, 
7D, 6D, KH. 


Which of a. or b. did you sample with replacement and which did you sample without replacement? 


Solution: 


a. Without replacement; b. With replacement 


Note: 
Try It 
Exercise: 


Problem: 


You have a fair, well-shuffled deck of 52 cards. It consists of four suits. The suits are clubs, diamonds, hearts, 
and spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, J (jack), Q (queen), and K 
(king) of that suit. S = spades, H = Hearts, D = Diamonds, C = Clubs. Suppose that you sample four cards 
without replacement. Which of the following outcomes are possible? Answer the same question for sampling 
with replacement. 


a. QS, 1D, 1C, QD 
b. KH, 7D, 6D, KH 
c. QS, 7D, 6D, KS 


Solution: 
without replacement: 1. Possible; 2. Impossible, 3. Possible 


with replacement: 1. Possible; 2. Possible, 3. Possible 


Mutually Exclusive Events 


A and B are mutually exclusive events if they cannot occur at the same time. Said another way, If A occurred then 
B cannot occur and vise-a-versa. This means that A and B do not share any outcomes and P(A B) = 0. 


For example, suppose the sample space S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Let A= {1, 2, 3, 4, 5}, B= {4, 5, 6, 7, 8}, 
and C = {7, 9}. AN B= {4,5}. PCAN B) = in and is not equal to zero. Therefore, A and B are not mutually 
exclusive. A and C do not have any numbers in common so P(A M C) = 0. Therefore, A and C are mutually 
exclusive. 


If it is not known whether A and B are mutually exclusive, assume they are not until you can show otherwise. 
The following examples illustrate these definitions and terms. 


Example: 

Flip two fair coins. (This is an experiment.) 

The sample space is {HH, HT, TH, TT} where T = tails and H = heads. The outcomes are HH, HT, TH, and TT. 
The outcomes HT and TH are different. The HT means that the first coin showed heads and the second coin 
showed tails. The TH means that the first coin showed tails and the second coin showed heads. 


e Let A = the event of getting at most one tail. (At most one tail means zero or one tail.) Then A can be written 
as {HH, HT, TH}. The outcome HH shows zero tails. HT and TH each show one tail. 

e Let B= the event of getting all tails. B can be written as {TT}. B is the complement of A, so B = A’. Also, 
P(A) + P(B) = P(A) + P(A’) = 1. 

¢ The probabilities for A and for B are P(A) = + and P(B) = +. 

e Let C = the event of getting all heads. C = {HH}. Since B = {TT}, P(BNC) = 0. Band C are mutually 
exclusive. (B and C have no members in common because you cannot have all tails and all heads at the same 
time.) 

e Let D = event of getting more than one tail. D = {TT}. P(D) = + 

e Let E = event of getting a head on the first roll. (This implies you can get either a head or tail on the second 
roll.) E = {HT, HH}. P(E) = + 

e Find the probability of getting at least one (one or two) tail in two flips. Let F = event of getting at least one 
tail in two flips. F = {HT, TH, TT}. P(F) = + 


Note: 
Try It 
Exercise: 


Problem: 


Draw two cards from a standard 52-card deck with replacement. Find the probability of getting at least one 
black card. 


Solution: 
Try It Solutions 


The sample space of drawing two cards with replacement from a standard 52-card deck with respect to color 
is {BB, BR, RB, RR}. 


Event A = Getting at least one black card = {BB, BR, RB} 


Example: 
Exercise: 


Problem: Flip two fair coins. Find the probabilities of the events. 


a. Let F = the event of getting at most one tail (zero or one tail). 
b. Let G = the event of getting two faces that are the same. 
c. Let H = the event of getting a head on the first flip followed by a head or tail on the second flip. 


d. Are F and G mutually exclusive? 
e. Let J = the event of getting all tails. Are J and H mutually exclusive? 


Solution: 
Look at the sample space in [link]. 


a. Zero (0) or one (1) tails occur when the outcomes HH, TH, HT show up. P(F) = ; 

b. Two faces are the same if HH or TT show up. P(G) = = 

c. A head on the first flip followed by a head or tail on the second flip occurs when HH or HT show up. 
PG 

d. F and G share HH so P(F' NM G) is not equal to zero (0). F and G are not mutually exclusive. 

e. Getting all tails occurs when tails shows up on both coins (TT). H’s outcomes are HH and HT. 


J and H have nothing in common so P(J M H) = 0. J and H are mutually exclusive. 


Note: 
Try It 
Exercise: 


Problem: 


A box has two balls, one white and one red. We select one ball, put it back in the box, and select a second 
ball (sampling with replacement). Find the probability of the following events: 


a. Let F = the event of getting the white ball twice. 

b. Let G = the event of getting two balls of different colors. 
c. Let H = the event of getting white on the first pick. 

d. Are F and G mutually exclusive? 

e. Are G and H mutually exclusive? 


Solution: 


ler ale 


a. P(F) 
b. P(G) 


c. P(H) 
d. Yes 
e. No 


Example: 


Roll one fair, six-sided die. The sample space is {1, 2, 3, 4, 5, 6}. Let event A = a face is odd. Then A = {1, 3, 5}. 
Let event B = a face is even. Then B = {2, 4, 6}. 


e Find the complement of A, A’. The complement of A, A’, is B because A and B together make up the sample 
space. P(A) + P(B) = P(A) + P(A’) = 1. Also, P(A) = 3 and P(B) = 3. 
e Let event C = odd faces larger than two. Then C = {3, 5}. Let event D = all even faces smaller than five. 


Then D = {2, 4}. P(C'N D) = 0 because you cannot have an odd and even face at the same time. Therefore, 
C and D are mutually exclusive events. 


e Let event E = all faces less than five. E = {1, 2, 3, 4}. 


Exercise: 


Problem: Are C and E mutually exclusive events? (Answer yes or no.) Why or why not? 


Solution: 


NO. C= 43h, St aumalie= Hil, Dah ay, P(C al E) = <: To be mutually exclusive, P(C’'M E) must be zero. 


e Find P(C|A). This is a conditional probability. Recall that the event C is {3, 5} and event A is {1, 3, 5}. To 
find P(C|A), find the probability of C using the sample space A. You have reduced the sample space from 
the original sample space {1, 2, 3, 4, 5, 6} to {1, 3, 5}. So, P(C|A) = = 


Note: 
Try It 
Exercise: 


Problem: 


Let event A = learning Spanish. Let event B = learning German. Then A M B = learning Spanish and 


German. Suppose P(A) = 0.4 and P(B) = 0.2. P(ANM B) = 0.08. Are events A and B independent? Hint: 
You must show ONE of the following: 


* P(A|B) = P(A) 
* P(B|A) = P(B) 
* P(ANB) = P(A)P(B) 


Solution: 


P(4|B) = 7b = $8 =0.4= P(A) 


The events are independent because P(A|B) = P(A). 


Example: 

Let event G = taking a math class. Let event H = taking a science class. Then, GM H = taking a math class and a 
science class. Suppose P(G) = 0.6, P(H) = 0.5, and P(GNM H) = 0.3. Are G and H independent? 

If G and H are independent, then you must show ONE of the following: 


* P(G|H) = P(G) 


Note: 

NOTE 

The choice you make depends on the information you have. You could choose any of the methods here 
because you have the necessary information. 


Exercise: 


Problem: a. Show that P(G|H) = P(G). 


Solution: 
P(G|H) = 755 = 28 -0.6 = P(a) 
Exercise: 


Problem: b. Show P(G 1M H) = P(G)P(A). 
Solution: 


P(G)P(H) = (0.6)(0.5) = 0.3 = P(GN H) 


Since G and H are independent, knowing that a person is taking a science class does not change the chance that he 
or she is taking a math class. If the two events had not been independent (that is, they are dependent) then 
knowing that a person is taking a science class would change the chance he or she is taking math. For practice, 
show that P(H|G) = P(#) to show that G and H are independent events. 


Note: 
Try It 
Exercise: 


Problem: 


In a bag, there are six red marbles and four green marbles. The red marbles are marked with the numbers 1, 
2, 3, 4, 5, and 6. The green marbles are marked with the numbers 1, 2, 3, and 4. 


e R=ared marble 

e G=a green marble 

e O=an odd-numbered marble 

e The sample space is S = {R1, R2, R3, R4, R5, R6, G1, G2, G3, G4}. 


S has ten outcomes. What is P(G MO)? 


Solution: 
Event G and O = {G1, G3} 


PGW OO) — 0:2 


Example: 
Exercise: 


Problem: Let event C = taking an English class. Let event D = taking a speech class. 
Suppose P(C’) = 0.75, P(D) = 0.3, P(C|D) = 0.75 and P(C'N D) = 0.225. 
Justify your answers to the following questions numerically. 


a. Are C and D independent? 
b. Are C and D mutually exclusive? 
c. What is P(D|C)? 


Solution: 


a. Yes, because P(C|D) = P(C). 
b. No, because P(C’M D) is not equal to zero. 


c P(D|c) = “CaP = 8 =0.3 


Note: 
Try It 
Exercise: 


Problem: 


A student goes to the library. Let events B = the student checks out a book and D = the student checks out a 
DVD. Suppose that P(B) = 0.40, P(D) = 0.30 and P(BN D) = 0.20. 


a. Find P(B|D). 

b. Find P(D|B). 

c. Are B and D independent? 

d. Are B and D mutually exclusive? 


Solution: 
a. P(B|D) = 0.6667 
b. P(D|B) = 0.5 
c. No 
d. No 


Example: 

In a box there are three red cards and five blue cards. The red cards are marked with the numbers 1, 2, and 3, and 
the blue cards are marked with the numbers 1, 2, 3, 4, and 5. The cards are well-shuffled. You reach into the box 
(you cannot see into it) and draw one card. 

Let R = red card is drawn, B = blue card is drawn, E = even-numbered card is drawn. 

The sample space S = R1, R2, R3, B1, B2, B3, B4, BS. S has eight outcomes. 


: P(B) = 2. PR a) B) = 0. (You cannot draw one card that is both red and blue.) 

ie (E) = +=. (There are three even-numbered cards, R2, B2, and B4.) 

P(E|B) = 2, (There are five blue cards: B1, B2, B3, B4, and B5. Out of the blue cards, there are two even 
cards; B2 and B4.) 

P(B\E) = . (There are three even-numbered cards: R2, B2, and B4. Out of the even-numbered cards, to 
are blue; B2 and B4.) 

The events R and B are mutually exclusive because P(RM B) = 0. 

Let G = card with a number greater than 3. G = {B4, B5}. P (G) = 2. Let H = blue card numbered between 
one and four, inclusive. H = {B1, B2, B3, B4}. P(G | )= + (The only card in H that has a number greater 


than three is B4.) Since = = +, P(G) = P(G|H), which means that G and H are independent. 


aco cofeo 


Note: 
Try It 
Exercise: 


Problem: In a basketball arena, 


e 70% of the fans are rooting for the home team. 
e 25% of the fans are wearing blue. 

e 20% of the fans are wearing blue and are rooting for the away team. 
e Of the fans rooting for the away team, 67% are wearing blue. 


Let A be the event that a fan is rooting for the away team. 
Let B be the event that a fan is wearing blue. 
Are the events of rooting for the away team and wearing blue independent? Are they mutually exclusive? 


Solution: 
P(B|A) = 0.67 
IPAS) = O25 


So P(B) does not equal P(B|A) which means that B and A are not independent (wearing blue and rooting 
for the away team are not independent). They are also not mutually exclusive, because P(B ™ A) = 0.20, 
not 0. 


Example: 

In a particular college class, 60% of the students are female. Fifty percent of all students in the class have long 
hair. Forty-five percent of the students are female and have long hair. Of the female students, 75% have long hair. 
Let F be the event that a student is female. Let L be the event that a student has long hair. One student is picked 
randomly. Are the events of being female and having long hair independent? 


The following probabilities are given in this example: 


Note: 

NOTE 

The choice you make depends on the information you have. You could use the first or last condition on the list 
for this example. You do not know P(F'|L) yet, so you cannot use the second condition. 


Solution 1 

Check whether P(F' 1 L) = P(F)P(L). We are given that P(F'M L) = 0.45, but 

P(F)P(L) = (0.60)(0.50) = 0.30. The events of being female and having long hair are not independent 
because P(F'™ L) does not equal P(F)P(L). 

Solution 2 

Check whether P(L|F’) equals P(L). We are given that P(L|F’) = 0.75, but P(L) = 0.50; they are not equal. 
The events of being female and having long hair are not independent. 

Interpretation of Results 

The events of being female and having long hair are not independent; knowing that a student is female changes 
the probability that a student has long hair. 


Note: 
Try It 
Exercise: 


Problem: 


Mark is deciding which route to take to work. His choices are J = the Interstate and F = Fifth Street. 


e P(I) =0.44 and P(F) = 0.56 
e P(IM F) = 0 because Mark will take only one route to work. 


What is the probability of P(I U F)? 
Solution: 
Because P(IM F) = 0, 


P(IUF) = P(I) + P(F) — P(IN F) =0.44+0.56-0=1 


Example: 
Exercise: 
Problem: 
a. Toss one fair coin (the coin has two sides, H and T). The outcomes are . Count the outcomes. 
There are outcomes. 
b. Toss one fair, six-sided die (the die has 1, 2, 3, 4, 5 or 6 dots on a side). The outcomes are 


. Count the outcomes. There are outcomes. 
c. Multiply the two numbers of outcomes. The answer is 


d. If you flip one fair coin and follow it with the toss of one fair, six-sided die, the answer to c is the 
number of outcomes (size of the sample space). What are the outcomes? (Hint: Two of the outcomes are 


H1 and T6.) 
e. Event A = heads (H) on the coin followed by an even number (2, 4, 6) on the die. 
A={ }. Find P(A). 
f. Event B = heads on the coin followed by a three on the die. B = { }. Find P(B). 


g. Are A and B mutually exclusive? (Hint: What is P(A B)? If P(ANM B) = 0, then A and B are 
mutually exclusive.) 

h. Are A and B independent? (Hint: Is P(A NM B) = P(A)P(B)? If P(AN B) = P(A)P(B), then A and 
Bare independent. If not, then they are dependent). 


Solution: 


a. H and T; 2 

[b), HL, A, 84h 15), GE) 

ce. 2(6)|= 12 

Gl, Wik, 1072, 103}, 1M), WS), INS) Jalil, Jad, Jale) Jah, felsy, fale) 
e. A= {H2, H4, H6}; P(A) = + 

f. B= {H3}; P(B) = 

g. Yes, because P(A B) = 0 

h. 


P(A a B) = (0); P(A)P(B) = (3). P(A al B) does not equal P(A)P(B), so A and B are depende 


Note: 
Try It 
Exercise: 


Problem: 


A box has two balls, one white and one red. We select one ball, put it back in the box, and select a second 
ball (sampling with replacement). Let T be the event of getting the white ball twice, F the event of picking 
the white ball first, S the event of picking the white ball in the second drawing. 


a. Compute P(T). 

b. Compute P(T|F). 

c. Are T and F independent?. 

d. Are F and S mutually exclusive? 
e. Are F and S independent? 


Solution: 


a. Pe) = 
b. P(T|F) 
c. No 
d. No 
e. Yes 


| Al 
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Chapter Review 


Two events A and B are independent if the knowledge that one occurred does not affect the chance the other 
occurs. If two events are not independent, then we say that they are dependent. 


In sampling with replacement, each member of a population is replaced after it is picked, so that member has the 
possibility of being chosen more than once, and the events are considered to be independent. In sampling without 


replacement, each member of a population may be chosen only once, and the events are considered not to be 
independent. When events do not share outcomes, they are mutually exclusive of each other. 


Formula Review 
If A and B are independent, P(AM B) = P(A)P(B), P(A|B) = P(A) and P(B|A) = P(B). 


IfA and Bare mutually exclusive, P(A U B) = P(A) + P(B) and P(A M B) =0. 
Exercise: 


Problem: £ and F are mutually exclusive events. P(E) = 0.4; P(F) = 0.5. Find P(E | F). 


Exercise: 


Problem: J and K are independent events. P(J|K) = 0.3. Find P(J). 


Solution: 


P(J) =0.3 


Exercise: 


Problem: U and V are mutually exclusive events. P(U) = 0.26; P(V) = 0.37. Find: 


Exercise: 


Problem: Q and RF are independent events. P(Q) = 0.4and P(QM R) = 0.1. Find P(R). 
Solution: 

P(QN R) = P(Q)P(R) 

0.1 = (0.4)P(R) 


P(R) = 0.25 


Homework 


Use the following information to answer the next 12 exercises. The graph shown is based on more than 170,000 
interviews done by Gallup that took place from January through December 2012. The sample consists of employed 
Americans 18 years of age or older. The Emotional Health Index Scores are the sample space. We randomly 
sample one Emotional Health Index Score. 


Emotional Health Index Score 


Service 

Transportation 
Manufacturing or production 
Sales 

Clerical or office 

Installation and repair 
Construction or mining 
Manager, executive, or official 
Business owner 

Nurse 

Professional 

Farming, fishing, or forestry 
Teacher (K-12) 

Physician 


Occupation 


Exercise: 


Problem: Find the probability that an Emotional Health Index Score is 82.7. 


Exercise: 


Problem: Find the probability that an Emotional Health Index Score is 81.0. 


Solution: 
0 


Exercise: 


Problem: Find the probability that an Emotional Health Index Score is more than 81? 


Exercise: 


Problem: Find the probability that an Emotional Health Index Score is between 80.5 and 82? 


Solution: 
0.3571 


Exercise: 


Problem: If we know an Emotional Health Index Score is 81.5 or more, what is the probability that it is 82.7? 


Exercise: 


Problem: What is the probability that an Emotional Health Index Score is 80.7 or 82.7? 


Solution: 


0.2142 


Exercise: 


Problem: 


What is the probability that an Emotional Health Index Score is less than 80.2 given that it is already less than 
81. 


Exercise: 
Problem: What occupation has the highest emotional index score? 
Solution: 
Physician (83.7) 


Exercise: 


Problem: What occupation has the lowest emotional index score? 
Exercise: 
Problem: What is the range of the data? 


Solution: 
83.7 — 79.6 = 4.1 


Exercise: 


Problem: Compute the average EHIS. 
Exercise: 


Problem: 


If all occupations are equally likely for a certain individual, what is the probability that he or she will have an 
occupation with lower than average EHIS? 


Solution: 


P(Occupation < 81.3) = 0.5 


Bringing It Together 


Exercise: 


Problem: 


A previous year, the weights of the members of the San Francisco 49ers and the Dallas Cowboys were 
published in the San Jose Mercury News. The factual data are compiled into [link]. 


Shirt # < 210 211-250 251-290 290< 


1-33 21 5 0 0 


Shirt # < 210 211-250 251-290 290s 
34-66 6 18 7 4 


66-99 6 12 22 5 


For the following, suppose that you randomly select one player from the 49ers or Cowboys. 


If having a shirt number from one to 33 and weighing at most 210 pounds were independent events, then what 
should be true about P(Shirt# 1—33]< 210 pounds)? 


Exercise: 


Problem: 


The probability that a male develops some form of cancer in his lifetime is 0.4567. The probability that a 
male has at least one false positive test result (meaning the test comes back for cancer when the man does not 
have it) is 0.51. Some of the following questions do not have enough information for you to answer them. 
Write “not enough information” for those answers. Let C = a man develops cancer in his lifetime and P = man 
has at least one false positive. 


a (OC) = 

b. P(P|C) = 

PPC y= 

d. If a test comes up positive, based upon numerical values, can you assume that man has cancer? Justify 
numerically and explain why or why not. 


Solution: 
a. P(C) = 0.4567 
b. not enough information 


c. not enough information 
d. No, because over half (0.51) of men have at least one false positive text 


Exercise: 


Problem: Given events G and H: P(G) = 0.43; P(H) = 0.26; P(H NG) = 0.14 
a. Find P(H UG). 


b. Find the probability of the complement of event (HM G). 
c. Find the probability of the complement of event (H U G). 


Exercise: 


Problem: Given events J and K : P(J) = 0.18; P(K) = 0.37; P(J UK) =0.45 
a. Find P(J 1 K). 


b. Find the probability of the complement of event (JM K). 
c. Find the probability of the complement of event (JM K). 


Solution: 


d. 

(JU K) = P(J) + P(K) — P(JN K); 0.45 = 0.18 + 0.37 — P(J. 0. K); solve to find P(JN K) = 0. 
b. P(NOT(JN K)) =1— P(JNK) =1—0.10 = 0.90 
c. P(NOT(J UK)) =1— P(JUK) =1—0.45 = 0.55 


Glossary 


Dependent Events 
If two events are NOT independent, then we say that they are dependent. 


Sampling with Replacement 
If each member of a population is replaced after it is picked, then that member has the possibility of being 
chosen more than once. 


Sampling without Replacement 
When sampling is done without replacement, each member of a population may be chosen only once. 


Two Basic Rules of Probability 


When calculating probability, there are two rules to consider when 
determining if two events are independent or dependent and if they are 
mutually exclusive or not. 


The Multiplication Rule 


If A and B are two events defined on a sample space, then: 
P(AN B) = P(B)P(A|B). We can think of the intersection symbol as 
substituting for the word "and". 


P(ANMB) 
P(B) 


This rule may also be written as: P(A|B) = 


This equation is read as the probability of A given B equals the probability of 
A and B divided by the probability of B. 


If A and B are independent, then P(A|B) = P(A). Then 
P(AN B) = P(A|B)P(B) becomes P(AN B) = P(A)(B) because the 
P(A|B) = P(A) if A and B are independent. 


One easy way to remember the multiplication rule is that the word "and" 
means that the event has to satisfy two conditions. For example the name 
drawn from the class roster is to be both a female and a sophomore. It is 
harder to satisfy two conditions than only one and of course when we 
multiply fractions the result is always smaller. This reflects the increasing 
difficulty of satisfying two conditions. 


The Addition Rule 


If A and B are defined on a sample space, then: 

P(AU B) = P(A) + P(B) — P(AN B). We can think of the union 
symbol substituting for the word "or". The reason we subtract the intersection 
of A and B is to keep from double counting elements that are in both A and B. 


If A and B are mutually exclusive, then P(A ™ B) = 0. Then 
P(AU B) = P(A) + P(B) — P(AN B) becomes 


P(AUB) = P(A) + P(B). 


Example: 
Klaus is trying to choose where to go on vacation. His two choices are: A = 
New Zealand and B = Alaska 


e Klaus can only afford one vacation. The probability that he chooses A is 
P(A) = 0.6 and the probability that he chooses B is P(B) = 0.35. 

¢ P(AN B) = 0 because Klaus can only afford to take one vacation 

¢ Therefore, the probability that he chooses either New Zealand or Alaska 
is P(A U B) = P(A) + P(B) = 0.6 + 0.35 = 0.95. Note that the 
probability that he does not choose to go anywhere on vacation must be 
0.05. 


Example: 

Carlos plays college soccer. He makes a goal 65% of the time he shoots. 
Carlos is going to attempt two goals in a row in the next game. A = the event 
Carlos is successful on his first attempt. P(A) = 0.65. B = the event Carlos is 
successful on his second attempt. P(B) = 0.65. Carlos tends to shoot in 
streaks. The probability that he makes the second goal | that he made the first 
goal is 0.90. 


Exercise: 


Problem: a. What is the probability that he makes both goals? 


Solution: 


a. The problem is asking you to find P(A NM B) = P(BN A). Since 
P(B|A) = 0.90: P(B NA) = P(BIA) P(A) = (0.90)(0.65) = 0.585 


Carlos makes the first and second goals with probability 0.585. 


Exercise: 


Problem: 


b. What is the probability that Carlos makes either the first goal or the 
second goal? 


Solution: 
b. The problem is asking you to find P(A U B). 
P(A U B) = P(A) + P(B) - P(AN B) = 0.65 + 0.65 - 0.585 = 0.715 


Carlos makes either the first goal or the second goal with probability 
0:75: 


Exercise: 


Problem: c. Are A and B independent? 
Solution: 

c. No, they are not, because P(B M A) = 0.585. 
P(B)P(A) = (0.65)(0.65) = 0.423 

0.423 4 0.585 = P(BM A) 


So, P(B M A) is not equal to P(B)P(A). 
Exercise: 


Problem: d. Are A and B mutually exclusive? 


Solution: 


d. No, they are not because P(A ™ B) = 0.585. 


To be mutually exclusive, P(A M B) must equal zero. 


Note: 
Try It 
Exercise: 


Problem: 


Helen plays basketball. For free throws, she makes the shot 75% of the 
time. Helen must now attempt two free throws. C = the event that Helen 
makes the first shot. P(C) = 0.75. D = the event Helen makes the second 
shot. P(D) = 0.75. The probability that Helen makes the second free 
throw given that she made the first is 0.85. What is the probability that 
Helen makes both free throws? 


Solution: 
P(D|C) = 0.85 


P(C NM D)=P(DN C) 
P(D NM C) = P(D|C)P(C) = (0.85)(0.75) = 0.6375 
Helen makes the first and second free throws with probability 0.6375. 


Example: 

A community swim team has 150 members. Seventy-five of the members 
are advanced swimmers. Forty-seven of the members are intermediate 
swimmers. The remainder are novice swimmers. Forty of the advanced 
swimmers practice four times a week. Thirty of the intermediate swimmers 
practice four times a week. Ten of the novice swimmers practice four times a 
week. Suppose one member of the swim team is chosen randomly. 


Exercise: 


Problem: 
a. What is the probability that the member is a novice swimmer? 


Solution: 


28 
a. 750 


Exercise: 


Problem: 
b. What is the probability that the member practices four times a week? 


Solution: 
80 
b. 150 
Exercise: 
Problem: 
c. What is the probability that the member is an advanced swimmer and 
practices four times a week? 
Solution: 


40 


C. 750 


Exercise: 


Problem: 
d. What is the probability that a member is an advanced swimmer and 
an intermediate swimmer? Are being an advanced swimmer and an 


intermediate swimmer mutually exclusive? Why or why not? 


Solution: 


d. P(advanced /M intermediate) = 0, so these are mutually exclusive 
events. A swimmer cannot be an advanced swimmer and an 
intermediate swimmer at the same time. 


Exercise: 


Problem: 


e. Are being a novice swimmer and practicing four times a week 
independent events? Why or why not? 


Solution: 


e. No, these are not independent events. 

P(novice M practices four times per week) = 0.0667 
P(novice)P(practices four times per week) = 0.0996 
0.0667 4 0.0996 


Note: 
Try It 
Exercise: 


Problem: 


A school has 200 seniors of whom 140 will be going to college next 
year. Forty will be going directly to work. The remainder are taking a 
gap year. Fifty of the seniors going to college play sports. Thirty of the 
seniors going directly to work play sports. Five of the seniors taking a 
gap year play sports. What is the probability that a senior is taking a gap 
year? 


Solution: 


— 200—140—40 _ 20 _ 
Le 200 F200 0.1 


Example: 

Felicity attends Modesto JC in Modesto, CA. The probability that Felicity 
enrolls in a math class is 0.2 and the probability that she enrolls in a speech 
class is 0.65. The probability that she enrolls in a math class | that she enrolls 
in speech class is 0.25. 

Let: M = math class, S = speech class, M|S = math given speech 

Exercise: 


Problem: 


a. What is the probability that Felicity enrolls in math and speech? 
Find P(M 1 S) = P(M|S)P(S). 

b. What is the probability that Felicity enrolls in math or speech 
classes? 
Find P(M U S) = P(M) + P(S) - P(M 1S). 

c. Are M and S independent? Is P(M|S) = P(M)? 

d. Are M and S mutually exclusive? Is PUM S) = 0? 


Solution: 


a. 0.1625, b. 0.6875, c. No, d. No 


Note: 
Try It 
Exercise: 


Problem: 
A student goes to the library. Let events B = the student checks out a 


book and D = the student check out a DVD. Suppose that P(B) = 0.40, 
P(D) = 0.30 and P(D|B) = 0.5. 


a. Find P(B MM D). 
b. Find P(B U D). 


Solution: 


a. P(B.M D) = P(D|B)P(B) = (0.5)(0.4) = 0.20. 
b. P(B U D) = P(B) + P(D) - P(BM D) = 0.40 + 0.30 — 0.20 = 0.50 


Example: 
Studies show that about one woman in seven (approximately 14.3%) who 
live to be 90 will develop breast cancer. Suppose that of those women who 
develop breast cancer, a test is negative 2% of the time. Also suppose that in 
the general population of women, the test for breast cancer is negative about 
85% of the time. Let B = woman develops breast cancer and let N = tests 
negative. Suppose one woman is selected at random. 
Exercise: 

Problem: 


a. What is the probability that the woman develops breast cancer? What 
is the probability that woman tests negative? 


Solution: 

a. P(B) = 0.143; P(N) = 0.85 
Exercise: 

Problem: 


b. Given that the woman has breast cancer, what is the probability that 
she tests negative? 


Solution: 


b. P(N|B) = 0.02 


Exercise: 


Problem: 


c. What is the probability that the woman has breast cancer AND tests 
negative? 


Solution: 

c. P(B NN) = P(B)P(N|B) = (0.143)(0.02) = 0.0029 
Exercise: 

Problem: 


d. What is the probability that the woman has breast cancer or tests 
negative? 


Solution: 

d. P(B U N) = P(B) + P(N) - P(B ON) = 0.143 + 0.85 - 0.0029 = 0.9901 
Exercise: 

Problem: 

e. Are having breast cancer and testing negative independent events? 

Solution: 

e. No. P(N) = 0.85; P(N|B) = 0.02. So, P(N|B) does not equal P(N). 
Exercise: 

Problem: 

f. Are having breast cancer and testing negative mutually exclusive? 


Solution: 


f. No. P(B M N) = 0.0029. For B and N to be mutually exclusive, P(BM 
N) must be zero. 


Note: 
Try It 
Exercise: 


Problem: 

A school has 200 seniors of whom 140 will be going to college next 
year. Forty will be going directly to work. The remainder are taking a 
gap year. Fifty of the seniors going to college play sports. Thirty of the 
seniors going directly to work play sports. Five of the seniors taking a 


gap year play sports. What is the probability that a senior is going to 
college and plays sports? 


Solution: 
Let A = student is a senior going to college. 


Let B = student plays sports. 


BE 595 
P(BIA) = 32, 


P(AM B) = P(BIA)P(A) 


PCAN B)= (399) (Gan) = 4 


Example: 
Exercise: 


Problem: Refer to the information in [link]. P = tests positive. 


a. Given that a woman develops breast cancer, what is the probability 
that she tests positive. Find P(P|B) = 1 - P(N|B). 

b. What is the probability that a woman develops breast cancer and 
tests positive. Find P(B NM P) = P(P|B)P(B). 

c. What is the probability that a woman does not develop breast 
cancer. Find P(B’) = 1 - P(B). 

d. What is the probability that a woman tests positive for breast 
cancer. Find P(P) = 1 - P(N). 


Solution: 


a. 0.98" bs0: L401; 620:857 50,015 


Note: 
Try It 
Exercise: 


Problem: 


A student goes to the library. Let events B = the student checks out a 
book and D = the student checks out a DVD. Suppose that P(B) = 0.40, 
P(D) = 0.30 and P(D|B) = 0.5. 


a. Find P(B’). 

b. Find P(D M B). 
c. Find P(B|D). 

d. Find P(D NB’). 


e, Find P(D|B’). 
Solution: 
a. P(B’) = 0.60 


b. P(D MB) = P(D|B)P(B) = 0.20 
PC BADY 0020) 


d. P(D MN B’)) = P(D) - P(D NB) = 0.30 - 0.20 = 0.10 


e. P(D|B’) = P(D N B)P(B’) = (P(D) - P(D N B))(0.60) = (0.10) 
(0.60) = 0.06 
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Chapter Review 


The multiplication rule and the addition rule are used for computing the 
probability of A and B, as well as the probability of A or B for two given 
events A, B defined on the sample space. In sampling with replacement each 
member of a population is replaced after it is picked, so that member has the 
possibility of being chosen more than once, and the events are considered to 
be independent. In sampling without replacement, each member of a 
population may be chosen only once, and the events are considered to be not 
independent. The events A and B are mutually exclusive events when they do 
not have any outcomes in common. 


Formula Review 
The multiplication rule: P(A M B) = P(A|B)P(B) 
The addition rule: P(A U B) = P(A) + P(B) - P(ANM B) 


Use the following information to answer the next ten exercises. Forty-eight 
percent of all Californians registered voters prefer life in prison without 
parole over the death penalty for a person convicted of first degree murder. 
Among Latino California registered voters, 55% prefer life in prison without 
parole over the death penalty for a person convicted of first degree murder. 
37.6% of all Californians are Latino. 


In this problem, let: 


¢ C= Californians (registered voters) preferring life in prison without 
parole over the death penalty for a person convicted of first degree 
murder. 

e [= Latino Californians 


Suppose that one Californian is randomly selected. 
Exercise: 


Problem: Find P(C). 
Exercise: 
Problem: Find P(L). 


Solution: 
0.376 


Exercise: 


Problem: Find P(C|L). 


Exercise: 


Problem: In words, what is C|L? 


Solution: 


C|L means, given the person chosen is a Latino Californian, the person is 
a registered voter who prefers life in prison without parole for a person 
convicted of first degree murder. 


Exercise: 


Problem: Find P(L ™ C). 


Exercise: 


Problem: In words, what is L M C? 


Solution: 


LM Cis the event that the person chosen is a Latino California 
registered voter who prefers life without parole over the death penalty 


for a person convicted of first degree murder. 


Exercise: 


Problem: Are L and C independent events? Show why or why not. 
Exercise: 


Problem: Find P(L U C). 


Solution: 
0.6492 


Exercise: 


Problem: In words, what is L U C? 
Exercise: 


Problem: 
Are L and C mutually exclusive events? Show why or why not. 
Solution: 


No, because P(L ™ C) does not equal 0. 


Homework 


Exercise: 


Problem: 


On February 28, 2013, a Field Poll Survey reported that 61% of 
California registered voters approved of allowing two people of the same 
gender to marry and have regular marriage laws apply to them. Among 
18 to 39 year olds (California registered voters), the approval rating was 
78%. Six in ten California registered voters said that the upcoming 
Supreme Court’s ruling about the constitutionality of California’s 
Proposition 8 was either very or somewhat important to them. Out of 
those CA registered voters who support same-sex marriage, 75% say the 
ruling is important to them. 


In this problem, let: 


ae ee EO OQ OAH oO fw 


C = California registered voters who support same-sex marriage. 

B = California registered voters who say the Supreme Court’s ruling 
about the constitutionality of California’s Proposition 8 is very or 
somewhat important to them 

A = California registered voters who are 18 to 39 years old. 


Find P(C). 

. Find P(B). 

. Find P(C\A). 

. Find P(B|C). 

. In words, what is C|A? 

. In words, what is B|C? 

. Find P(C 1M B). 

. In words, what is CM B? 

. Find P(C U B). 

. Are C and B mutually exclusive events? Show why or why not. 


Exercise: 


Problem: 


After Rob Ford, the mayor of Toronto, announced his plans to cut budget 
costs in late 2011, the Forum Research polled 1,046 people to measure 
the mayor’s popularity. Everyone polled expressed either approval or 
disapproval. These are the results their poll produced: 


e In early 2011, 60 percent of the population approved of Mayor 
Ford’s actions in office. 

e In mid-2011, 57 percent of the population approved of his actions. 

e In late 2011, the percentage of popular approval was measured at 42 
percent. 


a. What is the sample size for this study? 

b. What proportion in the poll disapproved of Mayor Ford, according 
to the results from late 2011? 

c. How many people polled responded that they approved of Mayor 
Ford in late 2011? 

d. What is the probability that a person supported Mayor Ford, based 
on the data collected in mid-2011? 

e. What is the probability that a person supported Mayor Ford, based 
on the data collected in early 2011? 


Solution: 


a. The Forum Research surveyed 1,046 Torontonians. 
b. 58% 

c. 42% of 1,046 = 439 (rounding to the nearest integer) 
d20:57 

e. 0.60. 


Use the following information to answer the next three exercises. The casino 
game, roulette, allows the gambler to bet on the probability of a ball, which 
spins in the roulette wheel, landing on a particular color, number, or range of 
numbers. The table used to place bets contains of 38 numbers, and each 
number is assigned to a color and a range. 
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(credit: film8ker/wikibooks) 


Exercise: 


Problem: 


a. List the sample space of the 38 possible outcomes in roulette. 

b. You bet on red. Find P(red). 

c. You bet on -1st 12- (1st Dozen). Find P(-1st 12-). 

d. You bet on an even number. Find P(even number). 

e. Is getting an odd number the complement of getting an even 
number? Why? 

f. Find two mutually exclusive events. 

g. Are the events Even and 1st Dozen independent? 


Exercise: 


Problem: 
Compute the probability of winning the following types of bets: 


a. Betting on two lines that touch each other on the table as in 1-2-3- 
4-5-6 

b. Betting on three numbers in a line, as in 1-2-3 

c. Betting on one number 


d. Betting on four numbers that touch each other to form a square, as 
in 10-11-13-14 

e. Betting on two numbers that touch each other on the table, as in 10- 
11 or 10-13 

f. Betting on 0-00-1-2-3 

g. Betting on 0-1-2; or 0-00-2; or 00-2-3 


Solution: 


a. P(Betting on two line that touch each other on the table) = & 

b. P(Betting on three numbers in a line) = Be 

c. P(Bettting on one number) = 45 

d. P(Betting on four number that touch each other to form a square) = 
oy 

e. P(Betting on two number that touch each other on the table ) = _- 

f, P(Betting on 0-00-1-2-3) = 4 

g. P(Betting on 0-1-2; or 0-00-2; or 00-2-3) = — 


Exercise: 


Problem: 
Compute the probability of winning the following types of bets: 


a. Betting on a color 

b. Betting on one of the dozen groups 

c. Betting on the range of numbers from 1 to 18 

d. Betting on the range of numbers 19-36 

e. Betting on one of the columns 

f. Betting on an even or odd number (excluding zero) 


Exercise: 


Problem: 


Suppose that you have eight cards. Five are green and three are yellow. 
The five green cards are numbered 1, 2, 3, 4, and 5. The three yellow 
cards are numbered 1, 2, and 3. The cards are well shuffled. You 
randomly draw one card. 


¢ G=card drawn is green 
e F =card drawn is even-numbered 


a. List the sample space. 

b. P(G) = 

c. P(G\E) = 

d. P(GN E) = 

e. PPG UE) = 

f. Are G and E mutually exclusive? Justify your answer 
numerically. 


Solution: 


1G1,-G2,G3,.G4, Ga, Y1, Y¥2; ¥3} 


00|c> obo co|DIG0] on 


a. 
b. 
C. 
d. 
e, 
f. No, because P(G ™ E) does not equal 0. 


Exercise: 


Problem: Roll two fair dice separately. Each die has six faces. 


a. List the sample space. 

b. Let A be the event that either a three or four is rolled first, followed 
by an even number. Find P(A). 

c. Let B be the event that the sum of the two rolls is at most seven. 


Find P(B). 


d. In words, explain what “P(A|B)” represents. Find P(A|B). 

e. Are A and B mutually exclusive events? Explain your answer in one 
to three complete sentences, including numerical justification. 

f. Are A and B independent events? Explain your answer in one to 
three complete sentences, including numerical justification. 


Exercise: 


Problem: 


A special deck of cards has ten cards. Four are green, three are blue, and 
three are red. When a card is picked, its color of it is recorded. An 
experiment consists of first picking a card and then tossing a coin. 


a. List the sample space. 

b. Let A be the event that a blue card is picked first, followed by 
landing a head on the coin toss. Find P(A). 

c. Let B be the event that a red or green is picked, followed by landing 
a head on the coin toss. Are the events A and B mutually exclusive? 
Explain your answer in one to three complete sentences, including 
numerical justification. 

d. Let C be the event that a red or blue is picked, followed by landing 
a head on the coin toss. Are the events A and C mutually exclusive? 
Explain your answer in one to three complete sentences, including 
numerical justification. 


Solution: 


Note: 
NOTE 
The coin toss is independent of the card picked first. 


a. MEDIC TBE) (BA) AM 


b. P(A) = P(blue)P(head) = (5) (>) = & 


c. Yes, A and B are mutually exclusive because they cannot happen at 
the same time; you cannot pick a card that is both blue and also (red 
or green). P(A M B) =0 

d. No, A and C are not mutually exclusive because they can occur at 
the same time. In fact, C includes all of the outcomes of A; if the 
card chosen is blue it is also (red or blue). P(A 1 C) = P(A) = 3 


Exercise: 


Problem: 
An experiment consists of first rolling a die and then tossing a coin. 


a. List the sample space. 

b. Let A be the event that either a three or a four is rolled first, 
followed by landing a head on the coin toss. Find P(A). 

c. Let B be the event that the first and second tosses land on heads. 
Are the events A and B mutually exclusive? Explain your answer in 
one to three complete sentences, including numerical justification. 


Exercise: 


Problem: 


An experiment consists of tossing a nickel, a dime, and a quarter. Of 
interest is the side the coin lands on. 


a. List the sample space. 

b. Let A be the event that there are at least two tails. Find P(A). 

c. Let B be the event that the first and second tosses land on heads. 
Are the events A and B mutually exclusive? Explain your answer in 
one to three complete sentences, including justification. 


Solution: 
a. S = {(HHH), (HHT), (HTH), (HTT), (THH), (THT), (TTH), (TTT)} 
b. = 


c. Yes, because if A has occurred, it is impossible to obtain two tails. 
In other words, P(A M B) = 0. 


Exercise: 


Consider the following scenario: 
Let P(C) = 0.4. 
Let P(D) = 0.5. 

Problem: Let P(C|D) = 0.6. 


a. Find P(C ND). 

b. Are C and D mutually exclusive? Why or why not? 
c. Are C and D independent events? Why or why not? 
d. Find P(C U D). 

e. Find P(D|C). 


Exercise: 


Problem: Y and Z are independent events. 


a. Rewrite the basic Addition Rule P(Y U Z) = P(Y) + P(Z) - P(YN Z) 
using the information that Y and Z are independent events. 

b. Use the rewritten rule to find P(Z) if P(Y U Z) = 0.71 and P(Y) = 
0.42. 


Solution: 
a. If Y and Z are independent, then P(Y M Z) = P(Y)P(Z), so P(Y U Z) 


= P(Y) + P(Z) - P(Y)P(Z). 
b. 0.5 


Exercise: 


Problem: G and H are mutually exclusive events. P(G) = 0.5 P(H) = 0.3 


a. Explain why the following statement MUST be false: P(H|G) = 0.4. 


b. Find P(H U G). 
c. Are G and H independent or dependent events? Explain in a 
complete sentence. 


Exercise: 
Problem: 
Approximately 281,000,000 people over age five live in the United 
States. Of these people, 55,000,000 speak a language other than English 


at home. Of those who speak another language at home, 62.3% speak 
Spanish. 


Let: E = speaks English at home; E' = speaks another language at home; 
S = speaks Spanish; 


Finish each probability statement by matching the correct answer. 


Probability Statements Answers 
a. P(E’) = i. 0.8043 
b. P(E) = ii. 0.623 
c. P(SM E')= iii. 0.1957 
d. P(S|E’) = iv. 0.1219 
Solution: 
iii iiv ii 


Exercise: 


Problem: 


1994, the U.S. government held a lottery to issue 55,000 Green Cards 
(permits for non-citizens to work legally in the U.S.). Renate Deutsch, 
from Germany, was one of approximately 6.5 million people who 
entered this lottery. Let G = won green card. 


a. What was Renate’s chance of winning a Green Card? Write your 
answer as a probability statement. 

b. In the summer of 1994, Renate received a letter stating she was one 
of 110,000 finalists chosen. Once the finalists were chosen, 
assuming that each finalist had an equal chance to win, what was 
Renate’s chance of winning a Green Card? Write your answer as a 
conditional probability statement. Let F = was a finalist. 

c. Are G and F independent or dependent events? Justify your answer 
numerically and also explain why. 

d. Are G and F mutually exclusive events? Justify your answer 
numerically and explain why. 


Exercise: 


Problem: 


Three professors at George Washington University did an experiment to 
determine if economists are more selfish than other people. They 
dropped 64 stamped, addressed envelopes with $10 cash in different 
classrooms on the George Washington campus. 44% were returned 
overall. From the economics classes 56% of the envelopes were 
returned. From the business, psychology, and history classes 31% were 
returned. 


Let: R = money returned; E = economics classes; O = other classes 


a. Write a probability statement for the overall percent of money 
returned. 

b. Write a probability statement for the percent of money returned out 
of the economics classes. 


c. Write a probability statement for the percent of money returned out 
of the other classes. 

d. Is money being returned independent of the class? Justify your 
answer numerically and explain it. 

e. Based upon this study, do you think that economists are more 
selfish than other people? Explain why or why not. Include 
numbers to justify your answer. 


Solution: 
a. P(R) = 0.44 
b. P(R|E) = 0.56 
c. P(R|O) = 0.31 


d. No, whether the money is returned is not independent of which 
class the money was placed in. There are several ways to justify 
this mathematically, but one is that the money placed in economics 
classes is not returned at the same overall rate; P(R|E) # P(R). 

e. No, this study definitely does not support that notion; in fact, it 
suggests the opposite. The money placed in the economics 
classrooms was returned at a higher rate than the money place in all 
classes collectively; P(R|E) > P(R). 


Exercise: 
Problem: 
The following table of data obtained from www.baseball-almanac.com 


shows hit information for four players. Suppose that one hit from the 
table is randomly selected. 


Home Total 
Name Single Double Triple run hits 


Name Single 
Babe 

Ruth 1,517 
Jackie 

Robinson ioe 


Ty Cobb 3,603 


Hank 


Aaron aieo 


Total 8,471 


Double 


506 


273 


174 


624 


1,577 


Triple 


136 


34 


295 


98 


383 


Home 
run 


714 


ileyé 


114 


790 


1,720 


Total 
hits 


2,873 


1,518 


4,189 


ay 


12,351 


Are "the hit being made by Hank Aaron" and "the hit being a double" 


independent events? 


a. Yes, because P(hit by Hank AaronJhit is a double) = P(hit by Hank 


Aaron) 


b. No, because P(hit by Hank Aaron|hit is a double) # P(hit is a 


double) 


c. No, because P(hit is by Hank Aaron|hit is a double) # P(hit by 


Hank Aaron) 


d. Yes, because P(hit is by Hank AaronJhit is a double) = P(hit is a 


double) 


Exercise: 


Problem: 


United Blood Services is a blood bank that serves more than 500 
hospitals in 18 states. According to their website, a person with type O 
blood and a negative Rh factor (Rh-) can donate blood to any person 
with any bloodtype. Their data show that 43% of people have type O 
blood and 15% of people have Rh- factor; 52% of people have type O or 
Rh- factor. 


a. Find the probability that a person has both type O blood and the 
Rh- factor. 

b. Find the probability that a person does NOT have both type O 
blood and the Rh- factor. 


Solution: 
a. P(type O U Rh-) = P(type O) + P(Rh-) - P(type OM Rh-) 


0.52 = 0.43 + 0.15 - P(type ON Rh-); solve to find P(type O N 
Rh-) = 0.06 


6% of people have type O, Rh- blood 
b. P(NOT(type O M Rh-)) = 1 - P(type O M Rh-) = 1 - 0.06 = 0.94 
94% of people do not have type O, Rh- blood 
Exercise: 
Problem: 


Ata college, 72% of courses have final exams and 46% of courses 
require research papers. Suppose that 32% of courses have a research 
paper and a final exam. Let F be the event that a course has a final exam. 
Let R be the event that a course requires a research paper. 


a. Find the probability that a course has a final exam or a research 
project. 


b. Find the probability that a course has NEITHER of these two 
requirements. 


Exercise: 


Problem: 


In a box of assorted cookies, 36% contain chocolate and 12% contain 
nuts. Of those, 8% contain both chocolate and nuts. Sean is allergic to 
both chocolate and nuts. 


a. Find the probability that a cookie contains chocolate or nuts (he 
can't eat it). 

b. Find the probability that a cookie does not contain chocolate or nuts 
(he can eat it). 


Solution: 


a. Let C = be the event that the cookie contains chocolate. Let N = the 
event that the cookie contains nuts. 

b. P(C U N) = P(C) + P(N) - P(C NN) = 0.36 + 0.12 - 0.08 = 0.40 

c. P(NEITHER chocolate NOR nuts) = 1 - P(C U N) = 1 - 0.40 = 0.60 


Exercise: 


Problem: 


A college finds that 10% of students have taken a distance learning class 
and that 40% of students are part time students. Of the part time 
students, 20% have taken a distance learning class. Let D = event that a 
student takes a distance learning class and E = event that a student is a 
part time student 


a. Find P(D | E). 

b. Find P(E|D). 

c. Find P(D U E). 

d. Using an appropriate test, show whether D and E are independent. 


e. Using an appropriate test, show whether D and E are mutually 
exclusive. 


Glossary 


Independent Events 
The occurrence of one event has no effect on the probability of the 
occurrence of another event. Events A and B are independent if one of 
the following is true: 


1. P(A|B) = P(A) 
2. P(BIA) = P(B) 
3. P(A n B) = P(A)P(B) 


Mutually Exclusive 
Two events are mutually exclusive if the probability that they both 
happen at the same time is zero. If events A and B are mutually 
exclusive, then P(A n B) = 0. 


Contingency Tables and Probability Trees 


Contingency Tables 


A contingency table provides a way of portraying data that can facilitate 
calculating probabilities. The table helps in determining conditional 
probabilities quite easily. The table displays sample values in relation to 
two different variables that may be dependent or contingent on one another. 
Later on, we will use contingency tables again, but in another manner. 


Example: 
Suppose a study of speeding violations and drivers who use cell phones 
produced the following fictional data: 


Speeding No speeding 

violation in violation in the 

the last year last year Total 
Uses cell 
phone while 25 280 305 
driving 


Does not use 
cell phone 45 405 450 
while driving 


Total 70 685 Eas 


The total number of people in the sample is 755. The row totals are 305 
and 450. The column totals are 70 and 685. Notice that 305 + 450 = 755 


and 70 + 685 = 755. 
Calculate the following probabilities using the table. 


Exercise: 


Problem: a. Find P(Driver is a cell phone user). 


Solution: 
number of cellphone users _ 305 
total number in study 2755 
Exercise: 


Problem: b. Find P(Driver had no violation in the last year). 


Solution: 


b number that had no violation _ 685 
; total number in study AE 


Exercise: 


Problem: 


c. Find P(Driver had no violation in the last year M was a cell phone 
user). 


Solution: 


280 
C. 755 


Exercise: 
Problem: 


d. Find P(Driver is a cell phone user U driver had no violation in the 
last year). 


Solution: 


305 685 280 _ 710 
ale si aa) 755 = 755 
Exercise: 
Problem: 


e, Find P(Driver is a cell phone user | driver had a violation in the last 
year). 


Solution: 


e. _ (The sample space is reduced to the number of drivers who had 
a violation.) 


Exercise: 
Problem: 
f. Find P(Driver had no violation last year | driver was not a cell 
phone user) 
Solution: 
i ee 


450 
were not cell phone users.) 


(The sample space is reduced to the number of drivers who 


Note: 
Try it 
Exercise: 


Problem: 


[link] shows the number of athletes who stretch before exercising and 
how many had injuries within the past year. 


Injury in last No injury in 

year last year Total 
Stretches 55 295 350 
poe uve 7 219 450 
stretch 
Total 286 514 800 


a. What is P(athlete stretches before exercising)? 
b. What is P(athlete stretches before exercising|no injury in the last 


year)? 
Solution: 
a. P(athlete stretches before exercising) = a = 0.4375 
b. P(athlete stretches before exercising|no injury in the last year) = 
ee 5728 
514 


Example: 


[link] shows a random sample of 100 hikers and the areas of hiking they 


prefer. 


Sex 
Female 
Male 


Total 


The 
coastline 


18 


Hiking Area Preference 


Exercise: 


Near lakes 
and streams 


16 


41 


Problem: a. Complete the table. 


Solution: 


da. 


Sex 


The 


coastline 


Near 
lakes and 
streams 


On 
mountain 
peaks 


14 


On 
mountain 
peaks 


Total 


Total 


Near 


The lakes and 
Sex coastline streams 
Female 18 16 
Male 16 25 
Total 34 41 


Hiking Area Preference 


Exercise: 


Problem: 


On 
mountain 
peaks 

11 

14 


25 


Total 


45 


3)e) 


100 


b. Are the events "being female" and "preferring the coastline" 


independent events? 


Let F = being female and let C = preferring the coastline. 


1. Find P(F NC). 
2. Find P(F)P(C) 


Are these two numbers the same? If they are, then F and C are 
independent. If they are not, then F and C are not independent. 


Solution: 
b. 


UEP) ys els 


2. P(F)P(C) = (=) (2) = (0.45)(0.34) = 0.153 


100 100 


P(F' 1 C) # P(F)P(C), so the events F and C are not independent. 


Exercise: 


Problem: 
c. Find the probability that a person is male given that the person 
prefers hiking near lakes and streams. Let M = being male, and let L = 


prefers hiking near lakes and streams. 


1. What word tells you this is a conditional? 


2. Fill in the blanks and calculate the probability: P(___|__) = 
3. Is the sample space for this problem all 100 hikers? If not, what 
is it? 
Solution: 


(C. 


1. The word ‘given’ tells you that this is a conditional. 

2. P(M|L) = 2 

3. No, the sample space for this problem is the 41 hikers who prefer 
lakes and streams. 


Exercise: 


Problem: 


d. Find the probability that a person is female or prefers hiking on 
mountain peaks. Let F = being female, and let P = prefers mountain 
peaks. 


1. Find P(F). 
2. Find P(P). 
3. Find P(F' P). 
4. Find P(F U P). 


Solution: 


d. 
Lis) = 
2. P(P) = 
3, P(E OP) = the 
4.P(FUP)= 73+ 


Note: 
Try It 
Exercise: 


Problem: 


100 


100 


[link] shows a random sample of 200 cyclists and the routes they 
prefer. Let M = males and H = hilly path. 


Lake 
Gender path 


Female 45 
Male 26 
Total 71 


Hilly 
path 


38 
D2 


90 


Wooded 
path 


27 
12 


oo 


Total 


110 


90 


200 


a. Out of the males, what is the probability that the cyclist prefers a 
hilly path? 

b. Are the events “being male” and “preferring the hilly path” 
independent events? 


Solution: 


a. P(H|M) = 2% = 0.5778 


b. For M and H to be independent, show P(H|M) = P(H) 
P(H|M) = 0.5778, P(H) = 3% = 0.45 


P(H|M) does not equal P(H) so M and H are NOT independent. 


Example: 

Muddy Mouse lives in a cage with three doors. If Muddy goes out the first 
door, the probability that he gets caught by Alissa the cat is = and the 
probability he is not caught is = If he goes out the second door, the 
probability he gets caught by Alissa is t and the probability he is not 
caught is 3. The probability that Alissa catches Muddy coming out of the 
third door is + and the probability she does not catch Muddy is +: It is 


equally likely that Muddy will choose any of the three doors so the 


probability of choosing each door is + 


Caught or Door Door Door 
not one two three Total 


Caught or Door Door Door 


not one two three Total 
Caught * 5 7 

Not caught + + 7 

Total <—? — ——s 1 


Door Choice 


¢ The first entry _- = (=) (= : ) is P(Door One M Caught) 
e The entry _ = (=) (+) is ee One M Not Caught) 


Verify the remaining entries. 


Exercise: 


Problem: 


a. Complete the probability contingency table. Calculate the entries 
for the totals. Verify that the lower-right corner entry is 1. 


Solution: 
a. 
Caught or Door Door Door 
not one two three Total 
1 1 1 19 
Caught 1b Dy 6 $0 


Caught or Door Door Door 


not one two three Total 
Not caught = + = = 
Total = an 2 1 


Door Choice 


Exercise: 
Problem: 
b. What is the probability that Alissa does not catch Muddy? 
Solution: 


41 
Ds Far 


Exercise: 
Problem: 


c. What is the probability that Muddy chooses Door One U Door Two 
given that Muddy is caught by Alissa? 


Solution: 


9 
19 


Example: 


[link] contains the number of crimes per 100,000 inhabitants from 2008 to 
2011 in the U.S. 


Year 


2008 


2009 


2010 


2011 


Total 


Robbery 
145.7 
133.1 
119.3 


113.7 


Burglary 
P324 
717.7 

701 


702.2 


Rape 
29) 
29-1 
oie. 


26.8 


Vehicle Total 


314.7 


JSS Ni 


assy Jed 


229.6 


United States Crime Index Rates Per 100,000 Inhabitants 2008-2011 


Exercise: 


Problem: TOTAL each column and each row. Total data = 4,520.7 


a. Find P(2009 N Robbery). 
b. Find P(2010N Burglary). 
c. Find P(2010 U Burglary). 


d. Find P(2011|Rape). 


e. Find P(Vehicle|2008). 


Solution: 


a, 0.0294, b. 0. W551, ¢; 0.7165, di, 0,2365,-6, 0.2575 


Note: 
Try It 


Exercise: 


Problem: 


[link] relates the weights and heights of a group of individuals 
participating in an observational study. 


Weight/height Tall Medium Short Totals 


Obese 18 28 14 
Normal 20 pil 28 
Underweight 12 25 9 
Totals 


a. Find the total for each row and column 

b. Find the probability that a randomly chosen individual from this 
group is Tall. 

c. Find the probability that a randomly chosen individual from this 
group is Obese and Tall. 

d. Find the probability that a randomly chosen individual from this 
group is Tall given that the idividual is Obese. 

e. Find the probability that a randomly chosen individual from this 
group is Obese given that the individual is Tall. 

f. Find the probability a randomly chosen individual from this 
group is Tall and Underweight. 

g. Are the events Obese and Tall independent? 


Solution: 


Weight/height Tall Medium Short Totals 


Obese 18 28 14 60 
Normal 20 51 28 99 
Underweight 12 25 9 46 
Totals 50 104 51 205 


a. Row Totals: 60, 99, 46. Column totals: 50, 104, 51. 
i OR) S es Sey ul 


205 
c. P(Obese M Tall) = 3% = 0.088 
d. P(Tall|Obese) = 45 = 0.3 
e. P(Obese|Tall) = — = 0.36 
f, P(Tall M Underweight) = 32 = 0.0585 


g. No. P(Tall) does not equal P(Tall|Obese). 


Tree Diagrams 


Sometimes, when the probability problems are complex, it can be helpful to 
graph the situation. Tree diagrams can be used to visualize and solve 
conditional probabilities. 


Tree Diagrams 


A tree diagram is a special type of graph used to determine the outcomes 
of an experiment. It consists of "branches" that are labeled with either 
frequencies or probabilities. Tree diagrams can make some probability 


problems easier to visualize and solve. The following example illustrates 
how to use a tree diagram. 


Example: 

In an urn, there are 11 balls. Three balls are red (R) and eight balls are blue 
(B). Draw two balls, one at a time, with replacement. "With replacement" 
means that you put the first ball back in the urn before you select the 
second ball. The tree diagram using frequencies that show all the possible 
outcomes follows. 


1st Draw 
8B 3R 
ra Fas 2nd Draw 
8B 3R 8B 3R 
64BB 24BR 24RB SRR 


Total = 64+ 24+ 24+9=121 


The first set of branches represents the first draw. The second set of 
branches represents the second draw. Each of the outcomes is distinct. In 
fact, we can list each red ball as R1, R2, and R3 and each blue ball as B1, 
B2, B3, B4, B5, B6, B7, and B8. Then the nine RR outcomes can be written 
as: 

R1R1 R1R2 R1R3 R2R1 R2R2 R2R3 R3R1 R3R2 R3R3 

The other outcomes are similar. 

There are a total of 11 balls in the urn. Draw two balls, one at a time, with 
replacement. There are 11(11) = 121 outcomes, the size of the sample 
space. 


Exercise: 


Problem: a. List the 24 BR outcomes: B1R1, B1R2, B1R3, ... 
Solution: 


a. B1R1 B1R2 B1R3 B2R1 B2R2 B2R3 B3R1 B3R2 B3R3 B4R1 B4R2 
B4R3 BS5R1 BSR2 BSR3 B6R1 B6R2 BER3 B7R1 B7R2 B7R3 BBR1 
B8R2 B8R3 


Exercise: 


Problem: b. Using the tree diagram, calculate P(RR). 
Solution: 


b. P(RR) = (Gr) (Gr) = ar 


Exercise: 


Problem: c. Using the tree diagram, calculate P(/RBU BR). 


Solution: 

Statins a Ge ee): (Gl Galas 
Exercise: 

Problem: 


d. Using the tree diagram, calculate 
P(R on 1st draw M B on 2nd draw). 


Solution: 


d. P(R on 1st draw N Bon 2nd draw) = (=) (4) = 45 


Exercise: 


Problem: 


e. Using the tree diagram, calculate P(R on 2nd draw|B on 1st draw). 


Solution: 


e, P(R on 2nd draw|B on 1st draw) = P(R on 2nd|B on 1st) = a = i 
This problem is a conditional one. The sample space has been reduced 
to those outcomes that already have a blue on the first draw. There are 
24 + 64 = 88 possible outcomes (24 BR and 64 BB). Twenty-four of 

He eae 


the 88 possible outcomes are BR. <3 = 37- 


Exercise: 


Problem: f. Using the tree diagram, calculate P(BB). 


Solution: 

f. P(BB) = = 
Exercise: 

Problem: 


g. Using the tree diagram, calculate P(B on the 2nd draw|R on the first 
draw). 


Solution: 


g. P(B on 2nd draw|R on 1st draw) = = 


There are 9 + 24 outcomes that have R on the first draw (9 RR and 24 
RB). The sample space is then 9 + 24 = 33. 24 of the 33 outcomes 
have B on the second draw. The probability is then a 


Note: 
Try It 
Exercise: 


Problem: 


In a standard deck, there are 52 cards. 12 cards are face cards (event 
F) and 40 cards are not face cards (event N). Draw two cards, one at a 
time, with replacement. All possible outcomes are shown in the tree 
diagram as frequencies. Using the tree diagram, calculate P(FF). 


ist Draw 
12F 4ON 
VN VN 2nd Draw 
12F AON 12F AON 
144FF A80FN A80NF 1,600NN 
Solution: 


Total number of outcomes is 144 + 480 + 480 + 1600 = 2,704. 


Se es eC ee een es 
UBT) = 144 + 480+ 480+1,600 2,704 169 


Example: 


An urn has three red marbles and eight blue marbles in it. Draw two 
marbles, one at a time, this time without replacement, from the urn. 
"Without replacement" means that you do not put the first ball back 
before you select the second marble. Following is a tree diagram for this 
situation. The branches are labeled with probabilities instead of 
frequencies. The numbers at the ends of the branches are calculated by 
multiplying the numbers on the two corresponding branches, for example, 


(Gr) (40) = aio 


ist Draw 
B R 
8 = 
11 11 
B R B R 2nd Draw 
ae ta, Ee ae: 
10 10 10 10 
56 24 24 6 
110 110 110 110 
BB BR RB RR 
— 5642442446 _ 110 _ 
orl — iG = si0 = 1 
Note: 
NOTE 


If you draw a red on the first draw from the three red possibilities, there 
are two red marbles left to draw on the second draw. You do not put back 
or replace the first marble after you have drawn it. You draw without 
replacement, so that on the second draw there are ten marbles left in the 
um. 


Calculate the following probabilities using the tree diagram. 


Exercise: 


Problem: a. P(RR) = 


Solution: 
a. P(RR) = (47) (an) = aio 
Exercise: 


Problem: b. Fill in the blanks: 

P(RBU BR) = (ar) (ao) + I) = to 

Solution: 

b. P(RBU BR) = (a7) (qa) + Gar) Gao) = to 
Exercise: 

Problem: c. P(R on 2nd|B on 1st) = 


Solution: 


c. P(R on 2nd|B on 1st) = = 
Exercise: 
Problem: d. Fill in the blanks. 


PGRom lst ae om. 21¢))— (meee) — — 


Solution: 


d. P(Ron 1st 9 Bon 2nd) = (+) cae = in 


Exercise: 
Problem: e. Find P(BB). 
Solution: 
e. P(BB) = (sr) (35) 
Exercise: 


Problem: f. Find P(B on 2nd|R on 1st). 
Solution: 


f. Using the tree diagram, P(B on 2nd|R on 1st) = P(R|B) = = 


If we are using probabilities, we can label the tree in the following general 
way. 


P(B) P(R) 


P(B| B) P(R| B) P(B| R) P(R| R) 


P(B AND B)=P(BB) P(BAND R)=P(BR) P(R AND B)=P(RB) P(R AND R)=P(RR) 


P(R|R) here means P(R on 2nd|R on Ist) 
P(B|R) here means P(B on 2nd|R on 1st) 
P(R|B) here means P(R on 2nd|B on 1st) 
P(B|B) here means P(B on 2nd|B on 1st) 


Note: 
Try It 
Exercise: 


Problem: 


In a standard deck, there are 52 cards. Twelve cards are face cards (F) 
and 40 cards are not face cards (N). Draw two cards, one at a time, 
without replacement. The tree diagram is labeled with all possible 
probabilities. 


1st Draw 
F N 
42 40 
52 52 
la la 2nd Draw 
i 40 12 39 
51 51 51 51 
132 480 480 1,560 
2,652 2,652 2,652 2,652 
FF FN NF NN 


a. Find P(FPN U NF). 
b. Find P(N|F). 
c. Find P(at most one face card). 
Hint: "At most one face card" means zero or one face card. 
d. Find P(at least on face card). 
Hint: "At least one face card" means one or two face cards. 


Solution: 


— _480 480 _ 960 _ 80 
a, PFN UNF) = es5 + 2652 = 2657 — 221 
b. P(N|F) = <9 


_ (480 + 480 + 1,560) _ 2,520 
c. P(at most one face card) = “———y 53 —— = 3 659 
_ (132 + 480 + 480) _ 1,092 
d. P(at least one face card) = ——y55 —— = 9 G53 
Example: 


A litter of kittens available for adoption at the Humane Society has four 
tabby kittens and five black kittens. A family comes in and randomly 
selects two kittens (without replacement) for adoption. 


1st Kitten 
i B 
4 Ss 
9 9 
1 B T B 2nd Kitten 
3 3. A = 
8 8 8 8 
TT TB BT BB 
Exercise: 
Problem: 


a. What is the probability that both kittens are tabby? 


a.(+) (4) b.(4) (4) (4) (4) (4) (8) 


2 
b. What is the probability that one kitten of each coloring is 


selected? 


a.(+) (3) b.( 


cole 
a 
——w 
Oo[ or 
—— 
(2) 
“—™~ 
cole 
oe 
“—™~ 
olor 
SS 
“—™~ 
colon 
SS 
“—™~ 
RoE 
SS 
jak 
“—™~ 
cols 
SS 
—~ 
o| on 
—) 
—~ 
olor 
7) 
“—™~ 
0] 
7) 


c. What is the probability that a tabby is chosen as the second kitten 
when a black kitten was chosen as the first? 

d. What is the probability of choosing two kittens of the same 
color? 


Solution: 


A 32 
aC. Ds Ge: Peace 5 


Note: 
Try It 
Exercise: 


Problem: 
Suppose there are four red balls and three yellow balls in a box. Two 
balls are drawn from the box without replacement. What is the 


probability that one ball of each coloring is selected? 


Solution: 
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Chapter Review 


There are several tools you can use to help organize and sort data when 
calculating probabilities. Contingency tables help display data and are 
particularly useful when calculating probabilites that have multiple 
dependent variables. 


A tree diagram use branches to show the different outcomes of experiments 
and makes complex probability questions easy to visualize. 


Glossary 


Tree Diagram 
the useful visual representation of a sample space and events in the 
form of a “tree” with branches marked by possible outcomes together 
with associated probabilities (frequencies, relative frequencies) 


Contingency Table 
the method of displaying a frequency distribution as a table with rows 
and columns to show how two variables may be dependent 
(contingent) upon each other; the table provides an easy way to 
calculate conditional probabilities. 


Introduction 
class="introduction" 


You can use 
probability 
and discrete 
random 
variables to 
calculate the 
likelihood of 
lightning 
striking the 
ground five 
times during 
a half-hour 
thunderstorm 
. (Credit: 
Leszek 
Leszczynski) 


A student takes a ten-question, true-false quiz. Because the student had such 
a busy schedule, he or she could not study and guesses randomly at each 
answer. What is the probability of the student passing the test with at least a 
70%? 


Small companies might be interested in the number of long-distance phone 
calls their employees make during the peak time of the day. Suppose the 
historical average is 20 calls. What is the probability that the employees 
make more than 20 long-distance phone calls during the peak time? 


These two examples illustrate two different types of probability problems 
involving discrete random variables. Recall that discrete data are data that 
you can count, that is, the random variable can only take on whole number 
values. A random variable describes the outcomes of a statistical 
experiment in words. The values of a random variable can vary with each 
repetition of an experiment, often called a trial. 


Random Variable Notation 


The upper case letter X denotes a random variable. Lower case letters like x 
or y denote the value of a random variable. If X is a random variable, then 
X is written in words, and x is given as a number. 


For example, let X = the number of heads you get when you toss three fair 
coins. The sample space for the toss of three fair coins is TTT; THH; HTH; 
HHT; HTT; THT; TTH; HHH. Then, x = 0, 1, 2, 3. X is in words and x is a 
number. Notice that for this example, the x values are countable outcomes. 
Because you can count the possible values as whole numbers that X can 
take on and the outcomes are random (the x values 0, 1, 2, 3), X is a discrete 
random variable. 


Probability Density Functions (PDF) for a Random Variable 


A probability density function or probability distribution function has 
two characteristics: 


1. Each probability is between zero and one, inclusive. 


2. The sum of the probabilities is one. 


A probability density function is a mathematical formula that calculates 
probabilities for specific types of events, what we have been calling 
experiments. There is a sort of magic to a probability density function (Pdf) 
partially because the same formula often describes very different types of 
events. For example, the binomial Pdf will calculate probabilities for 
flipping coins, yes/no questions on an exam, opinions of voters in an up or 
down opinion poll, indeed any binary event. Other probability density 
functions will provide probabilities for the time until a part will fail, when a 
customer will arrive at the turnpike booth, the number of telephone calls 
arriving at a central switchboard, the growth rate of a bacterium, and on and 
on. There are whole families of probability density functions that are used 
in a wide variety of applications, including medicine, business and finance, 
physics and engineering, among others. 


For our needs here we will concentrate on only a few probability density 
functions as we develop the tools of inferential statistics. 


Counting Formulas and the Combinational Formula 


To repeat, the probability of event A , P(A), is simply the number of ways 
the experiment will result in A, relative to the total number of possible 
outcomes of the experiment. 


As an equation this is: 
Equation: 


P(A) = number of ways to get A 
~ Total number of possible outcomes 


When we looked at the sample space for flipping 3 coins we could easily 
write the full sample space and thus could easily count the number of events 
that met our desired result, e.g. x = 1 , where X is the random variable 
defined as the number of heads. 


As we have larger numbers of items in the sample space, such as a full deck 
of 52 cards, the ability to write out the sample space becomes impossible. 


We see that probabilities are nothing more than counting the events in each 
group we are interested in and dividing by the number of elements in the 
universe, or sample space. This is easy enough if we are counting 
sophomores in a Stat class, but in more complicated cases listing all the 
possible outcomes may take a life time. There are, for example, 36 possible 
outcomes from throwing just two six-sided dice where the random variable 
is the sum of the number of spots on the up-facing sides. If there were four 
dice then the total number of possible outcomes would become 1,296. 
There are more than 2.5 MILLION possible 5 card poker hands in a 
standard deck of 52 cards. Obviously keeping track of all these possibilities 
and counting them to get at a single probability would be tedious at best. 


An alternative to listing the complete sample space and counting the 
number of elements we are interested in, is to skip the step of listing the 
sample space, and simply figuring out the number of elements in it and 
doing the appropriate division. If we are after a probability we really do not 
need to see each and every element in the sample space, we only need to 
know how many elements are there. Counting formulas were invented to do 
just this. They tell us the number of unordered subsets of a certain size that 
can be created from a set of unique elements. By unordered it is meant that, 
for example, when dealing cards, it does not matter if you got {ace, ace, 
ace, ace, king} or {king, ace, ace, ace, ace} or {ace, king, ace, ace, ace} and 
so on. Each of these subsets are the same because they each have 4 aces and 
one king. 


Combinational Formula 


Equation: 


This is the formula that tells the number of unique unordered subsets of size 
x that can be created from n unique elements. The formula is read “n 
combinatorial x”. Sometimes it is read as “n choose x." The exclamation 
point "!" is called a factorial and tells us to take all the numbers from 1 
through the number before the ! and multiply them together thus 4! is 
1-2-3-4=24. By definition 0! = 1. The formula is called the Combinatorial 
Formula. It is also called the Binomial Coefficient, for reasons that will be 
clear shortly. While this mathematical concept was understood long before 
1653, Blaise Pascal is given major credit for his proof that he published in 
that year. Further, he developed a generalized method of calculating the 
values for combinatorials known to us as the Pascal Triangle. Pascal was 
one of the geniuses of an era of extraordinary intellectual advancement 
which included the work of Galileo, Rene Descartes, Isaac Newton, 
William Shakespeare and the refinement of the scientific method, the very 
rationale for the topic of this text. 


Let’s find the hard way the total number of combinations of the four aces in 
a deck of cards if we were going to take them two at a time. The sample 
space would be: 


S={Spade,Heart),(Spade, Diamond),(Spade,Club), (Diamond,Club), 
(Heart, Diamond),(Heart,Club) } 


There are 6 combinations; formally, six unique unordered subsets of size 2 
that can be created from 4 unique elements. To use the combinatorial 
formula we would solve the formula as follows: 


Equation: 
ay. WA _y 
Dy (A= 2)12" ~ 2212827 


If we wanted to know the number of unique 5 card poker hands that could 
be created from a 52 card deck we simply compute: 


Equation: 
52 
+) 


where 52 is the total number of unique elements from which we are 
drawing and 5 is the size group we are putting them into. 


With the combinatorial formula we can count the number of elements in a 
sample space without having to write each one of them down, truly a 
lifetime's work for just the number of 5 card hands from a deck of 52 cards. 
We can now apply this tool to a very important probability density function, 
the hypergeometric distribution. 


Remember, a probability density function computes probabilities for us. We 
simply put the appropriate numbers in the formula and we get the 
probability of specific events. However, for these formulas to work they 
must be applied only to cases for which they were designed. 


Chapter Review 


The characteristics of a probability distribution or density function (PDF) 
are as follows: 


1. Each probability is between zero and one, inclusive (inclusive means 
to include zero and one). 
2. The sum of the probabilities is one. 


Use the following information to answer the next five exercises: A company 
wants to evaluate its attrition rate, in other words, how long new hires stay 
with the company. Over the years, they have established the following 
probability distribution. 


Let X = the number of years a new hire will stay with the company. 
Let P(x) = the probability that a new hire will stay with the company x 


years. 
Exercise: 


Problem: Complete [link] using the data provided. 


Solution: 


P(x) 
0.12 
0.18 
0.30 


0.15 


0.10 


0.05 


P(x) 
0.12 
0.18 
0.30 
0.15 


0.10 


x P(x) 


5 0.10 
6 0.05 
Exercise: 


Problem: P(x = 4) = 
Exercise: 

Problem: P(x > 5) = 

Solution: 


0.10 + 0.05 = 0.15 
Exercise: 


Problem: 


On average, how long would you expect a new hire to stay with the 
company? 


Exercise: 
Problem: What does the column “P(x)” sum to? 


Solution: 


1 


Use the following information to answer the next six exercises: A baker is 
deciding how many batches of muffins to make to sell in his bakery. He 


wants to make enough to sell every one and no fewer. Through observation, 
the baker has established a probability distribution. 


x P(x) 

1 0.15 

2 0.35 

3 0.40 

4 0.10 
Exercise: 


Problem: Define the random variable X. 
Exercise: 


Problem: 


What is the probability the baker will sell more than one batch? P(x > 
1) = 


Solution: 


0.35 + 0.40 + 0.10 = 0.85 
Exercise: 


Problem: 


What is the probability the baker will sell exactly one batch? P(x = 1) 


Exercise: 


Problem: On average, how many batches should the baker make? 


Solution: 


1(0.15) + 2(0.35) + 3(0.40) + 4(0.10) = 0.15 + 0.70 + 1.20 + 0.40 = 
2.45 


Use the following information to answer the next four exercises: Ellen has 
music practice three days a week. She practices for all of the three days 
85% of the time, two days 8% of the time, one day 4% of the time, and no 
days 3% of the time. One week is selected at random. 

Exercise: 


Problem: Define the random variable X. 


Exercise: 


Problem: Construct a probability distribution table for the data. 


Solution: 
x P(x) 
0 0.03 
1 0.04 


Z 0.08 


x P(x) 


3 0.85 


Exercise: 
Problem: 
We know that for a probability distribution function to be discrete, it 


must have two characteristics. One is that the sum of the probabilities 
is one. What is the other characteristic? 


Use the following information to answer the next five exercises: Javier 
volunteers in community events each month. He does not do more than five 
events in a month. He attends exactly five events 35% of the time, four 
events 25% of the time, three events 20% of the time, two events 10% of 
the time, one event 5% of the time, and no events 5% of the time. 
Exercise: 


Problem: Define the random variable X. 


Solution: 


Let X = the number of events Javier volunteers for each month. 


Exercise: 


Problem: What values does x take on? 


Exercise: 


Problem: Construct a PDF table. 


Solution: 


x P(x) 


0 0.05 
di 0.05 
Z 0.10 
3 0.20 
4 0.25 
5 0.35 
Exercise: 
Problem: 


Find the probability that Javier volunteers for less than three events 
each month. P(x < 3) = 


Exercise: 


Problem: 


Find the probability that Javier volunteers for at least one event each 
month. P(x > 0) = 


Solution: 


1—0.05 = 0.95 


Glossary 


Random Variable (RV) 
a characteristic of interest in a population being studied; common 
notation for variables are upper case Latin letters X, Y, Z,...; common 


notation for a specific value from the domain (set of all possible values 
of a variable) are lower case Latin letters x, y, and z. For example, if X 
is the number of children in a family, then x represents a specific 
integer 0, 1, 2, 3,.... Variables in statistics differ from variables in 
intermediate algebra in the two following ways. 


e The domain of the random variable (RV) is not necessarily a 
numerical set; the domain may be expressed in words; for 
example, if X = hair color then the domain is {black, blond, gray, 
green, orange}. 

e We can tell what specific value x the random variable X takes 
only after performing the experiment. 


Probability Distribution Function (PDF) 
a mathematical description of a discrete random variable (RV), given 
either in the form of an equation (formula) or in the form of a table 
listing all the possible outcomes of an experiment and the probability 
associated with each outcome. 


Binomial Distribution 


A more valuable probability density function with many applications is the 
binomial distribution. This distribution will compute probabilities for any 
binomial process. A binomial process, often called a Bernoulli process after 
the first person to fully develop its properties, is any case where there are 
only two possible outcomes in any one trial, called successes and failures. It 
gets its name from the binary number system where all numbers are 
reduced to either 1's or O's, which is the basis for computer technology and 
CD music recordings. 


Binomial Formula 


Equation: 


where b(x) is the probability of X successes in n trials when the probability 
of a success in ANY ONE TRIAL is p. And of course q=(1-p) and is the 
probability of a failure in any one trial. 


We can see now why the combinatorial formula is also called the binomial 
coefficient because it reappears here again in the binomial probability 
function. For the binomial formula to work, the probability of a success in 
any one trial must be the same from trial to trial, or in other words, the 
outcomes of each trial must be independent. Flipping a coin is a binomial 
process because the probability of getting a head in one flip does not 
depend upon what has happened in PREVIOUS flips. (At this time it should 
be noted that using p for the parameter of the binomial distribution is a 
violation of the rule that population parameters are designated with Greek 
letters. In many textbooks 8 (pronounced theta) is used instead of p and this 
is how it should be. 


Just like a set of data, a probability density function has a mean anda 
standard deviation that describes the data set. For the binomial distribution 


these are given by the formulas: 
Equation: 


Equation: 


o = ./npq 


Notice that p is the only parameter in these equations. The binomial 
distribution is thus seen as coming from the one-parameter family of 
probability distributions. In short, we know all there is to know about the 
binomial once we know p, the probability of a success in any one trial. 


In probability theory, under certain circumstances, one probability 
distribution can be used to approximate another. We say that one is the 
limiting distribution of the other. If a small number is to be drawn from a 
large population, even if there is no replacement, we can still use the 
binomial even thought this is not a binomial process. If there is no 
replacement it violates the independence rule of the binomial. Nevertheless, 
we can use the binomial to approximate a probability that is really a 
hypergeometric distribution if we are drawing fewer than 10 percent of the 
population, i.e. n is less than 10 percent of N in the formula for the 
hypergeometric function. The rationale for this argument is that when 
drawing a small percentage of the population we do not alter the probability 
of a success from draw to draw in any meaningful way. Imagine drawing 
from not one deck of 52 cards but from 6 decks of cards. The probability of 
say drawing an ace does not change the conditional probability of what 
happens on a second draw in the same way it would if there were only 4 
aces rather than the 24 aces now to draw from. This ability to use one 
probability distribution to estimate others will become very valuable to us 
later. 


There are three characteristics of a binomial experiment. 


1. There are a fixed number of trials. Think of trials as repetitions of an 
experiment. The letter n denotes the number of trials. 


2. The random variable, x, number of successes, is discrete. 

3. There are only two possible outcomes, called "success" and "failure," 
for each trial. The letter p denotes the probability of a success on any 
one trial, and q denotes the probability of a failure on any one trial. p + 
qg=1. 

4. The n trials are independent and are repeated using identical 
conditions. Think of this as drawing WITH replacement. Because the n 
trials are independent, the outcome of one trial does not help in 
predicting the outcome of another trial. Another way of saying this is 
that for each individual trial, the probability, p, of a success and 
probability, g, of a failure remain the same. For example, randomly 
guessing at a true-false statistics question has only two outcomes. If a 
success is guessing correctly, then a failure is guessing incorrectly. 
Suppose Joe always guesses correctly on any statistics true-false 
question with a probability p = 0.6. Then, q = 0.4. This means that for 
every true-false statistics question Joe answers, his probability of 
success (p = 0.6) and his probability of failure (q = 0.4) remain the 
same. 


The outcomes of a binomial experiment fit a binomial probability 
distribution. The random variable X = the number of successes obtained in 
the n independent trials. 


The mean, p, and variance, o7, for the binomial probability distribution are 
= np and o? = npg. The standard deviation, o, is then o = ,/npq. 


Any experiment that has characteristics three and four and where n = 1 is 
called a Bernoulli Trial (named after Jacob Bernoulli who, in the late 
1600s, studied them extensively). A binomial experiment takes place when 
the number of successes is counted in one or more Bernoulli Trials. 


Example: 

Suppose you play a game that you can only either win or lose. The 
probability that you win any game is 55%, and the probability that you lose 
is 45%. Each game you play is independent. If you play the game 20 times, 
write the function that describes the probability that you win 15 of the 20 


times. Here, if you define X as the number of wins, then X takes on the 
values 0, 1, 2, 3, ..., 20. The probability of a success is p = 0.55. The 
probability of a failure is q = 0.45. The number of trials is n = 20. The 
probability question can be stated mathematically as P(x = 15). 


Note: 

Try It 

Exercise: 
Problem: 
A trainer is teaching a dolphin to do tricks. The probability that the 
dolphin successfully performs the trick is 35%, and the probability 
that the dolphin does not successfully perform the trick is 65%. Out of 


20 attempts, you want to find the probability that the dolphin succeeds 
12 times. Find the P(X=12) using the binomial Pdf. 


Solution: 


P(x = 12) 


Example: 
Exercise: 


Problem: 

A fair coin is flipped 15 times. Each flip is independent. What is the 
probability of getting more than ten heads? Let X = the number of 
heads in 15 flips of the fair coin. X takes on the values 0, 1, 2, 3, ..., 


15. Since the coin is fair, p = 0.5 and q = 0.5. The number of trials is n 
= 15. State the probability question mathematically. 


Solution: 


Pees 10) 


Example: 

Approximately 70% of statistics students do their homework in time for it 
to be collected and graded. Each student does homework independently. In 
a Statistics class of 50 students, what is the probability that at least 40 will 
do their homework on time? Students are selected randomly. 


Exercise: 


Problem: 
a. This is a binomial problem because there is only a success or a 
, there are a fixed number of trials, and the probability of 
a success is 0.70 for each trial. 
Solution: 
a. failure 
Exercise: 


Problem: 


b. If we are interested in the number of students who do their 
homework on time, then how do we define X? 


Solution: 


b. X = the number of statistics students who do their homework on 
time 


Exercise: 


Problem: c. What values does x take on? 


Solution: 


Ce Ole eee) 
Exercise: 


Problem: d. What is a "failure," in words? 


Solution: 


d. Failure is defined as a student who does not complete his or her 
homework on time. 


The probability of a success is p = 0.70. The number of trials is n = 
50. 


Exercise: 


Problem: e. If p + g = 1, then what is q? 
Solution: 
e. g = 0.30 
Exercise: 
Problem: 


f. The words "at least" translate as what kind of inequality for the 
probability question P(x 40). 


Solution: 


f. greater than or equal to (=) 
The probability question is P(x = 40). 


Note: 
Try It 
Exercise: 


Problem: 


Sixty-five percent of people pass the state driver’s exam on the first 
try. A group of 50 individuals who have taken the driver’s exam is 
randomly selected. Give two reasons why this is a binomial problem. 


Solution: 


This is a binomial problem because there is only a success or a failure, 
and there are a definite number of trials. The probability of a success 
stays the same for each trial. 


Note: 
Try It 
Exercise: 


Problem: 


During the 2013 regular NBA season, DeAndre Jordan of the Los 
Angeles Clippers had the highest field goal completion rate in the 
league. DeAndre scored with 61.3% of his shots. Suppose you choose 
a random sample of 80 shots made by DeAndre during the 2013 
season. Let X = the number of shots that scored points. 


a. What is the probability distribution for X? 

b. Using the formulas, calculate the (i) mean and (ii) standard 
deviation of X. 

c. Find the probability that DeAndre scored with 60 of these shots. 

d. Find the probability that DeAndre scored with more than 50 of 
these shots. 


Solution: 
a. X ~ B(80, 0.613) 


b. i. Mean = np = 80(0.613) = 49.04 
ii. Standard Deviation = 
,/npq = »/80(0.613) (0.387) + 4.3564 


c. P(x = 60)= 0.0036 
d. P(x > 50) = 1 — P(x < 50) = 1— 0.6282 = 0.3718 
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Chapter Review 


A statistical experiment can be classified as a binomial experiment if the 
following conditions are met: 


1. There are a fixed number of trials, n. 

2. There are only two possible outcomes, called "success" and, "failure" 
for each trial. The letter p denotes the probability of a success on one 
trial and q denotes the probability of a failure on one trial. 

3. The n trials are independent and are repeated using identical 
conditions. 


The outcomes of a binomial experiment fit a binomial probability 
distribution. The random variable X = the number of successes obtained in 
the n independent trials. The mean of X can be calculated using the formula 
[= np, and the standard deviation is given by the formula o = ,/npgq. 


The formula for the Binomial probability density function is 
Equation: 


Formula Review 


X ~ B(n, p) means that the discrete random variable X has a binomial 
probability distribution with n trials and probability of success p. 


X = the number of successes in n independent trials 
n= the number of independent trials 

X takes on the values x = 0, 1, 2, 3, ..., n 

p = the probability of a success for any trial 

q = the probability of a failure for any trial 
Prat 

Gasp 


The mean of X is : = np. The standard deviation of X is o = ,/npq. 
Equation: 


where P(X) is the probability of X successes in n trials when the probability 
of a success in ANY ONE TRIAL is p. 


Use the following information to answer the next eight exercises: The 
Higher Education Research Institute at UCLA collected data from 203,967 
incoming first-time, full-time freshmen from 270 four-year colleges and 
universities in the U.S. 71.3% of those students replied that, yes, they 
believe that same-sex couples should have the right to legal marital status. 
Suppose that you randomly pick eight first-time, full-time freshmen from 
the survey. You are interested in the number that believes that same sex- 
couples should have the right to legal marital status. 

Exercise: 


Problem: In words, define the random variable X. 


Solution: 
X = the number that reply “yes” 


Exercise: 


Problem: X ~ ( ) 


Exercise: 
Problem: What values does the random variable X take on? 


Solution: 


Oo 1, 2,345 5; 05.750 


Exercise: 


Problem: Construct the probability distribution function (PDF). 


x P(x) 


Exercise: 


Problem: On average (i), how many would you expect to answer yes? 
Solution: 


Osi 


Exercise: 


Problem: What is the standard deviation (0)? 
Exercise: 


Problem: 
What is the probability that at most five of the freshmen reply “yes”? 
Solution: 


0.4151 
Exercise: 


Problem: 


What is the probability that at least two of the freshmen reply “yes”? 


HOMEWORK 


Exercise: 


Problem: 


According to a recent article the average number of babies born with 
significant hearing loss (deafness) is approximately two per 1,000 
babies in a healthy baby nursery. The number climbs to an average of 
30 per 1,000 babies in an intensive care nursery. 


Suppose that 1,000 babies from healthy baby nurseries were randomly 
surveyed. Find the probability that exactly two babies were born deaf. 


Use the following information to answer the next four exercises. Recently, a 
nurse commented that when a patient calls the medical advice line claiming 
to have the flu, the chance that he or she truly has the flu (and not just a 
nasty cold) is only about 4%. Of the next 25 patients calling in claiming to 
have the flu, we are interested in how many actually have the flu. 

Exercise: 


Problem: Define the random variable and list its possible values. 
Solution: 


X = the number of patients calling in claiming to have the flu, who 
actually have the flu. 


X= 051,25 25 


Exercise: 


Problem: State the distribution of X. 
Exercise: 


Problem: 


Find the probability that at least four of the 25 patients actually have 
the flu. 


Solution: 


0.0165 
Exercise: 
Problem: 
On average, for every 25 patients calling in, how many do you expect 
to have the flu? 


Exercise: 


Problem: 


People visiting video rental stores often rent more than one DVD ata 
time. The probability distribution for DVD rentals per customer at 
Video To Go is given [link]. There is five-video limit per customer at 
this store, so nobody ever rents more than five DVDs. 


x P(x) 
0 0.03 
1 0.50 
2 0.24 
3 

4 0.07 
rs) 0.04 


a. Describe the random variable X in words. 

b. Find the probability that a customer rents three DVDs. 

c. Find the probability that a customer rents at least four DVDs. 
d. Find the probability that a customer rents at most two DVDs. 


Solution: 


a. X = the number of DVDs a Video to Go customer rents 
b..0.12 
e. O11 


d. 0.77 


Exercise: 


Problem: 


A school newspaper reporter decides to randomly survey 12 students 
to see if they will attend Tet (Vietnamese New Year) festivities this 
year. Based on past years, she knows that 18% of students attend Tet 
festivities. We are interested in the number of students who will attend 
the festivities. 


a. In words, define the random variable X. 
b. List the values that X may take on. 


c. Give the distribution of X. X ~ ( ; ) 
d. How many of the 12 students do we expect to attend the 
festivities? 


e. Find the probability that at most four students will attend. 
f. Find the probability that more than two students will attend. 


Use the following information to answer the next two exercises: The 
probability that the San Jose Sharks will win any given game is 0.3694 
based on a 13-year win history of 382 wins out of 1,034 games played (as 
of a certain date). An upcoming monthly schedule contains 12 games. 
Exercise: 


Problem: The expected number of wins for that upcoming month is: 


a: 67 


Det? 
382 
C. 7043 


d. 4.43 


Solution: 


d. 4.43 


Let X = the number of games won in that upcoming month. 
Exercise: 


Problem: 


What is the probability that the San Jose Sharks win six games in that 
upcoming month? 


a. 0.1476 
b. 0.2336 
c. 0.7664 
d. 0.8903 


Exercise: 
Problem: 


What is the probability that the San Jose Sharks win at least five games 
in that upcoming month 


a. 0.3694 
b. 0.5266 
c. 0.4734 
d. 0.2305 


Solution: 


C 
Exercise: 

Problem: 

A student takes a ten-question true-false quiz, but did not study and 


randomly guesses each answer. Find the probability that the student 
passes the quiz with a grade of at least 70% of the questions correct. 


Exercise: 


Problem: 


A student takes a 32-question multiple-choice exam, but did not study 
and randomly guesses each answer. Each question has three possible 
choices for the answer. Find the probability that the student guesses 
more than 75% of the questions correctly. 


Solution: 


e X =number of questions answered correctly 

¢ X~ B(32, +) 

e We are interested in MORE THAN 75% of 32 questions correct. 
79% of 32 is 24. We want to find P(x > 24). The event "more than 
24" is the complement of "less than or equal to 24." 

e P(x > 24)=0 

¢ The probability of getting more than 75% of the 32 questions 
correct when randomly guessing is very small and practically 
zero. 


Exercise: 


Problem: 


Six different colored dice are rolled. Of interest is the number of dice 
that show a one. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. On average, how many dice would you expect to show a one? 

d. Find the probability that all six dice show a one. 

e. Is it more likely that three or that four dice will show a one? Use 
numbers to justify your answer numerically. 


Exercise: 


Problem: 


More than 96 percent of the very largest colleges and universities 
(more than 15,000 total enrollments) have some online offerings. 
Suppose you randomly pick 13 such institutions. We are interested in 
the number that offer distance learning courses. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. Give the distribution of X. X ~ ( ‘ ) 

d. On average, how many schools would you expect to offer such 
courses? 


e. Find the probability that at most ten offer such courses. 

f. Is it more likely that 12 or that 13 will offer such courses? Use 
numbers to justify your answer numerically and answer in a 
complete sentence. 


Solution: 


a. X = the number of college and universities that offer online 


offerings. 
be Qos Za4ueet 
GX ~BA3,.0:96) 
d. 12.48 
e. 0.0135 


f. P(x = 12) = 0.3186 P(x = 13) = 0.5882 More likely to get 13. 


Exercise: 


Problem: 


Suppose that about 85% of graduating students attend their graduation. 
A group of 22 graduating students is randomly chosen. 


a. In words, define the random variable X. 
b. List the values that X may take on. 
c. Give the distribution of X. X ~ ( 


) 


2 


d. How many are expected to attend their graduation? 

e. Find the probability that 17 or 18 attend. 

f. Based on numerical values, would you be surprised if all 22 
attended graduation? Justify your answer numerically. 


Exercise: 


Problem: 


At The Fencing Center, 60% of the fencers use the foil as their main 
weapon. We randomly survey 25 fencers at The Fencing Center. We 
are interested in the number of fencers who do not use the foil as their 
main weapon. 


. In words, define the random variable X. 

. List the values that X may take on. 

. Give the distribution of X. X ~ ( ) 

. How many are expected to not to use the foil as their main 
weapon? 

e. Find the probability that six do not use the foil as their main 

weapon. 

f. Based on numerical values, would you be surprised if all 25 did 

not use foil as their main weapon? Justify your answer 

numerically. 


BT 


an Oo 


Solution: 


a. X = the number of fencers who do not use the foil as their main 


weapon 
Des 2 Seo2e 
c. X ~ B(25,0.40) 
d. 10 

e. 0.0442 


f. The probability that all 25 not use the foil is almost zero. 
Therefore, it would be very surprising. 


Exercise: 


Problem: 


Approximately 8% of students at a local high school participate in 
after-school sports all four years of high school. A group of 60 seniors 
is randomly chosen. Of interest is the number who participated in 
after-school sports all four years of high school. 


. In words, define the random variable X. 

. List the values that X may take on. 

. Give the distribution of X. X ~ ( ) 

. How many seniors are expected to have participated in after- 

school sports all four years of high school? 

e. Based on numerical values, would you be surprised if none of the 
seniors participated in after-school sports all four years of high 
school? Justify your answer numerically. 

f. Based upon numerical values, is it more likely that four or that 

five of the seniors participated in after-school sports all four years 

of high school? Justify your answer numerically. 


2 


an oO 


Exercise: 


Problem: 


The chance of an IRS audit for a tax return with over $25,000 in 
income is about 2% per year. We are interested in the expected number 
of audits a person with that income has in a 20-year period. Assume 
each year is independent. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. Give the distribution of X. X ~ ( ) 

d. How many audits are expected in a 20-year period? 

e. Find the probability that a person is not audited at all. 

f. Find the probability that a person is audited more than twice. 


2 


Solution: 


a. X = the number of audits in a 20-year period 
DOs dD ew 20) 

c. X ~ B(20, 0.02) 

d. 0.4 

e. 0.6676 

f. 0.0071 


Exercise: 


Problem: 


It has been estimated that only about 30% of California residents have 
adequate earthquake supplies. Suppose you randomly survey 11 
California residents. We are interested in the number who have 
adequate earthquake supplies. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. Give the distribution of X. X ~ ( ) 

d. What is the probability that at least eight have adequate 
earthquake supplies? 

e. Is it more likely that none or that all of the residents surveyed will 
have adequate earthquake supplies? Why? 

f. How many residents do you expect will have adequate earthquake 
supplies? 


By 


Exercise: 


Problem: 


There are two similar games played for Chinese New Year and 
Vietnamese New Year. In the Chinese version, fair dice with numbers 
1, 2, 3, 4, 5, and 6 are used, along with a board with those numbers. In 
the Vietnamese version, fair dice with pictures of a gourd, fish, rooster, 
crab, crayfish, and deer are used. The board has those six objects on it, 
also. We will play with bets being $1. The player places a bet on a 
number or object. The “house” rolls three dice. If none of the dice 
show the number or object that was bet, the house keeps the $1 bet. If 
one of the dice shows the number or object bet (and the other two do 
not show it), the player gets back his or her $1 bet, plus $1 profit. If 
two of the dice show the number or object bet (and the third die does 
not show it), the player gets back his or her $1 bet, plus $2 profit. If all 
three dice show the number or object bet, the player gets back his or 
her $1 bet, plus $3 profit. Let X = number of matches and Y = profit 
per game. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. List the values that Y may take on. Then, construct one PDF table 
that includes both X and Y and their probabilities. 

d. Calculate the average expected matches over the long run of 
playing this game for the player. 

e. Calculate the average expected earnings over the long run of 
playing this game for the player. 

f. Determine who has the advantage, the player or the house. 


Solution: 


1. X =the number of matches 


220545273 
3. In dollars: -1, 1, 2, 3 
4t 
#29 
5. The answer is —0.0787. You lose about eight cents, on average, 


per game. 


6. The house has the advantage. 


Exercise: 


Problem: 


According to The World Bank, only 9% of the population of Uganda 
had access to electricity as of 2009. Suppose we randomly sample 150 
people in Uganda. Let X = the number of people who have access to 
electricity. 


a. What is the probability distribution for X? 

b. Using the formulas, calculate the mean and standard deviation of 
Xx. 

c. Find the probability that 15 people in the sample have access to 
electricity. 

d. Find the probability that at most ten people in the sample have 
access to electricity. 

e. Find the probability that more than 25 people in the sample have 
access to electricity. 


Exercise: 


Problem: 


The literacy rate for a nation measures the proportion of people age 15 
and over that can read and write. The literacy rate in Afghanistan is 
28.1%. Suppose you choose 15 people in Afghanistan at random. Let 
X = the number of people who are literate. 


a. Sketch a graph of the probability distribution of X. 

b. Using the formulas, calculate the (i) mean and (ii) standard 
deviation of X. 

c. Find the probability that more than five people in the sample are 
literate. Is it is more likely that three people or four people are 
literate. 


Solution: 


a. X ~ B(15, 0.281) 


0.25 


0.2 


0.15 


0.1 


0.05 


0 12 3 4 5 6 7 8 9 10 11 12 13 14 15 


b. i. Mean = p = np = 15(0.281) = 4.215 
ii. Standard Deviation = o = ,/npq = \/15(0.281)(0.719) = 
1.7409 


c. P(x > 5)=1 — 0.7754 = 0.2246 
P(x = 3) = 0.1927 
P(x = 4) = 0.2259 
It is more likely that four people are literate that three people are. 


Glossary 


Binomial Experiment 
a Statistical experiment that satisfies the following three conditions: 


1. There are a fixed number of trials, n. 

2. There are only two possible outcomes, called "success" and, 
"failure," for each trial. The letter p denotes the probability of a 
success on one trial, and q denotes the probability of a failure on 
one trial. 

3. The n trials are independent and are repeated using identical 
conditions. 


Bernoulli Trials 
an experiment with the following characteristics: 


1. There are only two possible outcomes called “success” and 
“failure” for each trial. 

2. The probability p of a success is the same for any trial (so the 
probability q = 1 — p of a failure is the same for any trial). 


Binomial Probability Distribution 
a discrete random variable (RV) that arises from Bernoulli trials; there 
are a fixed number, n, of independent trials. “Independent” means that 
the result of any trial (for example, trial one) does not affect the results 
of the following trials, and all trials are conducted under the same 
conditions. Under these circumstances the binomial RV X is defined as 
the number of successes in n trials. The mean is p! = np and the 
standard deviation is o = ,/npq. The probability of exactly x successes 
in n trials is 


n = 
P(X =x)= (") p*qr*. 


Poisson Distribution 


Another useful probability distribution is the Poisson distribution, or waiting time distribution. 
This distribution is used to determine how many checkout clerks are needed to keep the waiting 
time in line to specified levels, how may telephone lines are needed to keep the system from 
overloading, and many other practical applications. A modification of the Poisson, the Pascal, 
invented nearly four centuries ago, is used today by telecommunications companies worldwide 
for load factors, satellite hookup levels and Internet capacity problems. The distribution gets its 
name from Simeon Poisson who presented it in 1837 as an extension of the binomial distribution 
which we will see can be estimated with the Poisson. 


There are two main characteristics of a Poisson experiment. 


1. The Poisson probability distribution gives the probability of a number of events occurring 
in a fixed interval of time or space if these events happen with a known average rate. 

2. The events are independently of the time since the last event. For example, a book editor 
might be interested in the number of words spelled incorrectly in a particular book. It might 
be that, on the average, there are five words spelled incorrectly in 100 pages. The interval is 
the 100 pages and it is assumed that there is no relationship between when misspellings 
occur. 

3. The random variable X = the number of occurrences in the interval of interest. 


Example: 
Exercise: 


Problem: 


A bank expects to receive six bad checks per day, on average. What is the probability of the 
bank getting fewer than five bad checks on any given day? Of interest is the number of 
checks the bank receives in one day, so the time interval of interest is one day. Let X = the 
number of bad checks the bank receives in one day. If the bank expects to receive six bad 
checks per day then the average is six checks per day. Write a mathematical statement for 
the probability question. 


Solution: 


POL) 


Example: 

You notice that a news reporter says "uh," on average, two times per broadcast. What is the 
probability that the news reporter says "uh" more than two times per broadcast. 

This is a Poisson problem because you are interested in knowing the number of times the news 
reporter says "uh" during a broadcast. 


Exercise: 


Problem: a. What is the interval of interest? 
Solution: 
a. one broadcast measured in minutes 
Exercise: 
Problem: 
b. What is the average number of times the news reporter says "uh" during one broadcast? 
Solution: 


be 2 


Exercise: 


Problem: c. Let X = . What values does X take on? 
Solution: 


c. Let X = the number of times the news reporter says "uh" during one broadcast. 
a=) dbs Ze Bh on 


Exercise: 


Problem: d. The probability question is P( i 


Solution: 


d. P(x > 2) 


Notation for the Poisson: P = Poisson Probability Distribution Function 

X~ P() 

Read this as "X is arandom variable with a Poisson distribution." The parameter is p/ (or A); p (or 
A) = the mean for the interval of interest. The mean is the number of occurrences that occur on 


average during the interval period. 


The formula for computing probabilities that are from a Poisson process is: 


Equation: 


pre # 


P(x) 


x! 


where P(X) is the probability of X successes, ps is the expected number of successes based upon 
historical data, e is the natural logarithm approximately equal to 2.718, and X is the number of 
successes per unit, usually per unit of time. 


In order to use the Poisson distribution, certain assumptions must hold. These are: the probability 
of a success, 1, is unchanged within the interval, there cannot be simultaneous successes within 
the interval, and finally, that the probability of a success among intervals is independent, the 
same assumption of the binomial distribution. 


In a way, the Poisson distribution can be thought of as a clever way to convert a continuous 
random variable, usually time, into a discrete random variable by breaking up time into discrete 
independent intervals. This way of thinking about the Poisson helps us understand why it can be 
used to estimate the probability for the discrete random variable from the binomial distribution. 
The Poisson is asking for the probability of a number of successes during a period of time while 
the binomial is asking for the probability of a certain number of successes for a given number of 
trials. 


Example: 

Leah's answering machine receives about six telephone calls between 8 a.m. and 10 a.m. What 
is the probability that Leah receives more than one call in the next 15 minutes? 

Let X = the number of calls Leah receives in 15 minutes. (The interval of interest is 15 minutes 
or + hour.) 

= (Des 

If Leah receives, on the average, six telephone calls in two hours, and there are eight 15 minute 
intervals in two hours, then Leah receives 

(+) (6) = 0.75 calls in 15 minutes, on average. So, pt = 0.75 for this problem. 

iX ~ P(0.75) 

Find P(x > 1). P(x > 1) = 0.1734 

Probability that Leah receives more than one telephone call in the next 15 minutes is about 
0.1734. 

The graph of X ~ P(0.75) is: 


P(X=x) 


x=0123... 


The y-axis contains the probability of x where X = the number of calls in 15 minutes. 


Example: 

According to a survey a university professor gets, on average, 7 emails per day. Let X = the 
number of emails a professor receives per day. The discrete random variable X takes on the 
values x = 0, 1, 2 .... The random variable X has a Poisson distribution: X ~ P(7). The mean is 7 
emails. 

Exercise: 


Problem: 
a. What is the probability that an email user receives exactly 2 emails per day? 


b. What is the probability that an email user receives at most 2 emails per day? 
c. What is the standard deviation? 


Solution: 


a P(e =2) = 4 = 2" — 0.022 


b.P(2 <2) = T+ Te" + Te" — 0.029 
c. Standard Deviation = 0 = ,/ = V7 = 2.65 


Example: 


Text message users receive or send an average of 41.5 text messages per day. 
Exercise: 


Problem: 
a. How many text messages does a text message user receive or send per hour? 


b. What is the probability that a text message user receives or sends two messages per 
hour? 


c. What is the probability that a text message user receives or sends more than two 
messages per hour? 


Solution: 


a. Let X = the number of texts that a user sends or receives in one hour. The average 
number of texts received per hour is ae Re 292: 


b. P(z =2) = HF = Lave _ 0.265 


xX: 


C. P(x > 2) i P(a ss 2) al bee si ae +" ae 


= 0.250 


Example: 
Exercise: 


Problem: 

On May 13, 2013, starting at 4:30 PM, the probability of low seismic activity for the next 
48 hours in Alaska was reported as about 1.02%. Use this information for the next 200 days 
to find the probability that there will be low seismic activity in ten of the next 200 days. 
Use both the binomial and Poisson distributions to calculate the probabilities. Are they 
close? 

Solution: 


Let X = the number of days with low seismic activity. 


Using the binomial distribution: 
P(x =10) = ayaa * -0102!° x .9898! = 0.000039 


Using the Poisson distribution: 


Calculate p = np = 200(0.0102) * 2.04 
P(x 10) wet = 20ate = 9 000045 


x! 10! 


We expect the approximation to be good because n is large (greater than 20) and p is small 
(less than 0.05). The results are close—both probabilities reported are almost 0. 


Estimating the Binomial Distribution with the Poisson Distribution 


We found before that the binomial distribution provided an approximation for the 
hypergeometric distribution. Now we find that the Poisson distribution can provide an 
approximation for the binomial. We say that the binomial distribution approaches the Poisson. 
The binomial distribution approaches the Poisson distribution is as n gets larger and p is small 
such that np becomes a constant value. There are several rules of thumb for when one can say 
they will use a Poisson to estimate a binomial. One suggests that np, the mean of the binomial, 
should be less than 25. Another author suggests that it should be less than 7. And another, noting 
that the mean and variance of the Poisson are both the same, suggests that np and npgq, the mean 
and variance of the binomial, should be greater than 5. There is no one broadly accepted rule of 
thumb for when one can use the Poisson to estimate the binomial. 


As we move through these probability distributions we are getting to more sophisticated 
distributions that, in a sense, contain the less sophisticated distributions within them. This 
proposition has been proven by mathematicians. This gets us to the highest level of 
sophistication in the next probability distribution which can be used as an approximation to all of 
those that we have discussed so far. This is the normal distribution. 


Example: 

A survey of 500 seniors in the Price Business School yields the following information. 75% go 
straight to work after graduation. 15% go on to work on their MBA. 9% stay to get a minor in 
another program. 1% go on to get a Master's in Finance. 

Exercise: 


Problem: 


What is the probability that more than 2 seniors go to graduate school for their Master's in 
finance? 


Solution: 

This is clearly a binomial probability distribution problem. The choices are binary when we 
define the results as "Graduate School in Finance" versus "all other options." The random 
variable is discrete, and the events are, we could assume, independent. Solving as a 


binomial problem, we have: 


Binomial Solution 


Equation: 
n-p=500-0.001=5=yp 
Equation: 
500! -0 
20 ——————— a) ee eOGe 
0!(500 — 0)! 


Equation: 


500! 


1 5007 ee 
TG ee ee 


Ee) 


Equation: 


500! 


2 5007 
BGap i001 = 0.01)" = 0.08368 


EOS 


Adding all 3 together = 0.12339 
Equation: 


1 — 0.12339 = 0.87661 


Poisson approximation 


Equation: 
n:-p=500-0.001=5=yp 
Equation: 
n-p:(1—p) =500-0.01- (0.99) +5 =o? =p 
Equation: 
px) = < Ter = fp) = 2 hs (pa) = hs {py - 
Equation: 
0.0067 + 0.0337 + 0.0842 = 0.1247 
Equation: 


1 — 0.1247 = 0.8753 


An approximation that is off by 1 one thousandth is certainly an acceptable approximation. 
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Chapter Review 


A Poisson probability distribution of a discrete random variable gives the probability of a 
number of events occurring in a fixed interval of time or space, if these events happen at a known 
average rate and independently of the time since the last event. The Poisson distribution may be 
used to approximate the binomial, if the probability of success is "small" (less than or equal to 
0.01) and the number of trials is "large" (greater than or equal to 25). Other rules of thumb are 
also suggested by different authors, but all recognize that the Poisson distribution is the limiting 
distribution of the binomial as n increases and p approaches zero. 


The formula for computing probabilities that are from a Poisson process is: 
Equation: 


pre # 
eS 
xz: 


where P(X) is the probability of successes, 1) (pronounced mu) is the expected number of 
successes, e is the natural logarithm approximately equal to 2.718, and X is the number of 
successes per unit, usually per unit of time. 


Formula Review 


X ~ P(t) means that X has a Poisson probability distribution where X = the number of 
occurrences in the interval of interest. 


X takes on the values x = 0, 1, 2, 3, ... 
The mean p or A is typically given. 


The variance is o* = p, and the standard deviation is 


o = 4/ pi. 


When P() is used to approximate a binomial distribution, 1 = np where n represents the number 
of independent trials and p represents the probability of success in a single trial. 
Equation: 


pre # 


P(x) I 


Use the following information to answer the next six exercises: On average, a clothing store gets 
120 customers per day. 
Exercise: 


Problem: 

Assume the event occurs independently in any given day. Define the random variable X. 
Exercise: 

Problem: What values does X take on? 


Solution: 
0; 1.253 Anes 


Exercise: 


Problem: What is the probability of getting 150 customers in one day? 


Exercise: 


Problem: 


What is the probability of getting 35 customers in the first four hours? Assume the store is 
open 12 hours each day. 


Solution: 


0.0485 
Exercise: 


Problem: 


What is the probability that the store will have more than 12 customers in the first hour? 
Exercise: 
Problem: 


What is the probability that the store will have fewer than 12 customers in the first two 
hours? 


Solution: 


0.0214 
Exercise: 


Problem: 


Which type of distribution can the Poisson model be used to approximate? When would you 
do this? 


Use the following information to answer the next six exercises: On average, eight teens in the 
U.S. die from motor vehicle injuries per day. As a result, states across the country are debating 
raising the driving age. 

Exercise: 


Problem: 


Assume the event occurs independently in any given day. In words, define the random 
variable X. 


Solution: 


X = the number of U.S. teens who die from motor vehicle injuries per day. 


Exercise: 


Problem:X ~ ( ; ) 


Exercise: 


Problem: What values does X take on? 


Solution: 


O34 525 33 Ay e 
Exercise: 


Problem: 


For the given values of the random variable X, fill in the corresponding probabilities. 
Exercise: 
Problem: 


Is it likely that there will be no teens killed from motor vehicle injuries on any given day in 
the U.S? Justify your answer numerically. 


Solution: 


No 
Exercise: 
Problem: 


Is it likely that there will be more than 20 teens killed from motor vehicle injuries on any 
given day in the U.S.? Justify your answer numerically. 


HOMEWORK 


Exercise: 


Problem: 


The switchboard in a Minneapolis law office gets an average of 5.5 incoming phone calls 
during the noon hour on Mondays. Experience shows that the existing staff can handle up to 
six calls in an hour. Let X = the number of calls received at noon. 


a. Find the mean and standard deviation of X. 

b. What is the probability that the office receives at most six calls at noon on Monday? 

c. Find the probability that the law office receives six calls at noon. What does this mean 
to the law office staff who get, on average, 5.5 incoming phone calls at noon? 

d. What is the probability that the office receives more than eight calls at noon? 


Solution: 


a. X ~ P(5.5); p= 5.530 = V5.5 © 2.3452 

b. P(x < 6) ¥ 0.6860 

c. There is a 15.7% probability that the law staff will receive more calls than they can 
handle. 

d. P(x > 8) = 1— P(x < 8) 1 — 0.8944 = 0.1056 


Exercise: 
Problem: 
The maternity ward at Dr. Jose Fabella Memorial Hospital in Manila in the Philippines is 


one of the busiest in the world with an average of 60 births per day. Let X = the number of 
births in an hour. 


a. Find the mean and standard deviation of X. 

b. Sketch a graph of the probability distribution of X. 

c. What is the probability that the maternity ward will deliver three babies in one hour? 

d. What is the probability that the maternity ward will deliver at most three babies in one 
hour? 

e. What is the probability that the maternity ward will deliver more than five babies in 
one hour? 


Exercise: 


Problem: 


A manufacturer of Christmas tree light bulbs knows that 3% of its bulbs are defective. Find 
the probability that a string of 100 lights contains at most four defective bulbs using both 
the binomial and Poisson distributions. 


Solution: 
Let X = the number of defective bulbs in a string. 
Using the Poisson distribution: 

¢ p=np = 100(0.03) = 3 

e X~ P(3) 

© P(x <4) * 0.8153 


Using the binomial distribution: 


¢ X~ B(100, 0.03) 
° P(x < 4) = 0.8179 


The Poisson approximation is very good—the difference between the probabilities is only 
0.0026. 


Exercise: 


Problem: 


The average number of children a Japanese woman has in her lifetime is 1.37. Suppose that 
one Japanese woman is randomly chosen. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. Find the probability that she has no children. 

d. Find the probability that she has fewer children than the Japanese average. 
e. Find the probability that she has more children than the Japanese average. 


Exercise: 


Problem: 


The average number of children a Spanish woman has in her lifetime is 1.47. Suppose that 
one Spanish woman is randomly chosen. 


a. In words, define the Random Variable X. 

b. List the values that X may take on. 

c. Find the probability that she has no children. 

d. Find the probability that she has fewer children than the Spanish average. 
e. Find the probability that she has more children than the Spanish average . 


Solution: 


a. X = the number of children for a Spanish woman 
BAO, 152, as 

C-0.2299 

d. 0.5679 

e. 0.4321 


Exercise: 


Problem: 


Fertile, female cats produce an average of three litters per year. Suppose that one fertile, 
female cat is randomly chosen. In one year, find the probability she produces: 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. Give the distribution of X. X ~ 

d. Find the probability that she has no litters in one year. 

e. Find the probability that she has at least two litters in one year. 
f. Find the probability that she has exactly three litters in one year. 


Exercise: 


Problem: 


The chance of having an extra fortune in a fortune cookie is about 3%. Given a bag of 144 
fortune cookies, we are interested in the number of cookies with an extra fortune. Two 
distributions may be used to solve this problem, but only use one distribution to solve the 
problem. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. How many cookies do we expect to have an extra fortune? 

d. Find the probability that none of the cookies have an extra fortune. 

e. Find the probability that more than three have an extra fortune. 

f. As n increases, what happens involving the probabilities using the two distributions? 
Explain in complete sentences. 


Solution: 


a. X = the number of fortune cookies that have an extra fortune 
b. 0, 1, 2, 3,... 144 

c. 4.32 

d. 0.0124 or 0.0133 

e. 0.6300 or 0.6264 

f. As n gets larger, the probabilities get closer together. 


Exercise: 


Problem: 


According to the South Carolina Department of Mental Health web site, for every 200 U.S. 
women, the average number who suffer from anorexia is one. Out of a randomly chosen 
group of 600 U.S. women determine the following. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. Give the distribution ofX. X ~ ( ‘ ) 

d. How many are expected to suffer from anorexia? 

e. Find the probability that no one suffers from anorexia. 

f. Find the probability that more than four suffer from anorexia. 


Exercise: 


Problem: 


The chance of an IRS audit for a tax return with over $25,000 in income is about 2% per 
year. Suppose that 100 people with tax returns over $25,000 are randomly picked. We are 
interested in the number of people audited in one year. Use a Poisson distribution to anwer 
the following questions. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. How many are expected to be audited? 

d. Find the probability that no one was audited. 

e. Find the probability that at least three were audited. 


Solution: 


a. X = the number of people audited in one year 
be Qs ,:2; si 100 

G2 

d, 0.1353 

@, 0.3233 


Exercise: 


Problem: 


Approximately 8% of students at a local high school participate in after-school sports all 
four years of high school. A group of 60 seniors is randomly chosen. Of interest is the 
number that participated in after-school sports all four years of high school. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. How many seniors are expected to have participated in after-school sports all four 
years of high school? 

d. Based on numerical values, would you be surprised if none of the seniors participated 
in after-school sports all four years of high school? Justify your answer numerically. 

e. Based on numerical values, is it more likely that four or that five of the seniors 
participated in after-school sports all four years of high school? Justify your answer 
numerically. 


Exercise: 


Problem: 


On average, Pierre, an amateur chef, drops three pieces of egg shell into every two cake 
batters he makes. Suppose that you buy one of his cakes. 


a. In words, define the random variable X. 

b. List the values that X may take on. 

c. On average, how many pieces of egg shell do you expect to be in the cake? 

d. What is the probability that there will not be any pieces of egg shell in the cake? 

e. Let’s say that you buy one of Pierre’s cakes each week for six weeks. What is the 
probability that there will not be any egg shell in any of the cakes? 

f. Based upon the average given for Pierre, is it possible for there to be seven pieces of 
shell in the cake? Why? 


Solution: 


a. X = the number of shell pieces in one cake 
beQ, T25 B.%: 

c.1.5 

d.0:2231 

e. 0.0001 

f. Yes 


Use the following information to answer the next two exercises: The average number of times per 
week that Mrs. Plum’s cats wake her up at night because they want to play is ten. We are 
interested in the number of times her cats wake her up each week. 

Exercise: 


Problem: In words, the random variable X = 


a. the number of times Mrs. Plum’s cats wake her up each week. 
b. the number of times Mrs. Plum’s cats wake her up each hour. 

c. the number of times Mrs. Plum’s cats wake her up each night. 
d. the number of times Mrs. Plum’s cats wake her up. 


Exercise: 


Problem: 
Find the probability that her cats will wake her up no more than five times next week. 


a. 0.5000 
b.0:9329 
c. 0.0378 
d. 0.0671 


Solution: 


d 


Glossary 


Poisson Probability Distribution 
a discrete random variable (RV) that counts the number of times a certain event will occur 
in a specific interval; characteristics of the variable: 


¢ The probability that the event occurs in a given interval is the same for all intervals. 


e The events occur with a known mean and independently of the time since the last 
event. 


The distribution is defined by the mean p of the event in the interval. The mean is ps = np. 
The standard deviation is o = ,/2. The probability of having exactly x successes in r trials 


is P(x) = * <— . The Poisson distribution is often used to approximate the binomial 
distribution, when n is “large” and p is “small” (a general rule is that np should be greater 


than or equal to 25 and p should be less than or equal to 0.01). 


Introduction 
class="introduction" 


If you ask 
enough 
people 

about their 

shoe size, 
you will 
find that 
your 
graphed 
data is 
shaped 
like a bell 
curve and 
can be 
described 
as 
normally 
distributed 

. (credit: 
Omer 
Unli) 


The normal probability density function, a continuous distribution, is the 
most important of all the distributions. It is widely used and even more 
widely abused. Its graph is bell-shaped. You see the bell curve in almost all 
disciplines. Some of these include psychology, business, economics, the 
sciences, nursing, and, of course, mathematics. Some of your instructors 
may use the normal distribution to help determine your grade. Most IQ 
scores are normally distributed. Often real-estate prices fit a normal 
distribution. 


The normal distribution is extremely important, but it cannot be applied to 
everything in the real world. Remember here that we are still talking about 
the distribution of population data. This is a discussion of probability and 
thus it is the population data that may be normally distributed, and if it is, 
then this is how we can find probabilities of specific events just as we did 
for population data that may be binomially distributed or Poisson 
distributed. This caution is here because in the next chapter we will see that 
the normal distribution describes something very different from raw data 
and forms the foundation of inferential statistics. 


The normal distribution has two parameters (two numerical descriptive 
measures): the mean (1) and the standard deviation (0). If X is a quantity to 
be measured that has a normal distribution with mean (1) and standard 
deviation (0), we designate this by writing the following formula of the 


normal probability density function: 
NORMAL: X~N (yu, o) 


Lt 


The probability density function is a rather complicated function. Do not 
memorize it. It is not necessary. 
Equation: 


The curve is symmetric about a vertical line drawn through the mean, p. 
The mean is the same as the median, which is the same as the mode, 
because the graph is symmetric about p. As the notation indicates, the 
normal distribution depends only on the mean and the standard deviation. 
Note that this is unlike several probability density functions we have 
already studied, such as the Poisson, where the mean is equal to pz and the 
standard deviation simply the square root of the mean, or the binomial, 
where p is used to determine both the mean and standard deviation. Since 
the area under the curve must equal one, a change in the standard deviation, 
o, causes a change in the shape of the normal curve; the curve becomes 
fatter and wider or skinnier and taller depending on o. A change in p causes 
the graph to shift to the left or right. This means there are an infinite 
number of normal probability distributions. One of special interest is called 
the standard normal distribution. 


Formula Review 
X ~ N(H, 0) 


pt = the mean o = the standard deviation 


Glossary 


Normal Distribution 
a continuous random variable (RV) with pdf f(x) = 


1 (zp)? 
== € 202 
ov 21 


, where p is the mean of the distribution and o is the standard 
deviation; notation: X ~ N(p, 0). If uy = 0 and o = 1, the RV, Z, is called 
the standard normal distribution. 


The Standard Normal Distribution 


The standard normal distribution is a normal distribution of 
standardized values called z-scores. A z-score is measured in units of 
the standard deviation. 


The mean for the standard normal distribution is zero, and the standard 
deviation is one. What this does is dramatically simplify the mathematical 
calculation of probabilities. Take a moment and substitute zero and one in 
the appropriate places in the above formula and you can see that the 
equation collapses into one that can be much more easily solved using 
integral calculus. The transformation z = aa produces the distribution Z ~ 


N(0, 1). The value x in the given equation comes from a known normal 
distribution with known mean p and known standard deviation o. The z- 
score tells how many standard deviations a particular x is away from the 
mean. 


Z-Scores 


If X is anormally distributed random variable and X ~ N(p, 0), then the z- 
score for a particular x is: 
Equation: 


The z-score tells you how many standard deviations the value x is above 
(to the right of) or below (to the left of) the mean, p. Values of x that are 
larger than the mean have positive z-scores, and values of x that are smaller 
than the mean have negative z-scores. If x equals the mean, then x has a z- 
score of zero. 


Example: 
Suppose X ~ N(5, 6). This says that X is a normally distributed random 
variable with mean p = 5 and standard deviation o = 6. Suppose x = 17. 


Then: 
Equation: 
z—-jp 17-5 


== t————_————— ——_ — 2? 
a oO 6 


This means that x = 17 is two standard deviations (20) above or to the 
right of the mean pi = 5. 
Now suppose x = 1. Then: z= —# = 4% =—0.67 (rounded to two decimal 


places) 
This means that x = 1 is 0.67 standard deviations (—0.670) below or to 
the left of the mean p = 5. 


The Empirical Rule 
If X is arandom variable and has a normal distribution with mean p and 
standard deviation o, then the Empirical Rule states the following: 


e About 68% of the x values lie between —1o and +10 of the mean pL 
(within one standard deviation of the mean). 

e About 95% of the x values lie between —20 and +20 of the mean p 
(within two standard deviations of the mean). 

e About 99.7% of the x values lie between —30 and +30 of the mean p 
(within three standard deviations of the mean). Notice that almost all 
the x values lie within three standard deviations of the mean. 

e The z-scores for +10 and —1o are +1 and —1, respectively. 

e The z-scores for +20 and —2o are +2 and —2, respectively. 

e The z-scores for +30 and —30 are +3 and —3 respectively. 


Example: 
Suppose x has a normal distribution with mean 50 and standard deviation 


6. 


e About 68% of the x values lie within one standard deviation of the 


mean. Therefore, about 68% of the x values lie between —1o = (—1)(6) 
= —6 and 1o = (1)(6) = 6 of the mean 50. The values 50 — 6 = 44 and 
50 + 6 = 56 are within one standard deviation from the mean 50. The 
z-scores are —1 and +1 for 44 and 56, respectively. 

About 95% of the x values lie within two standard deviations of the 
mean. Therefore, about 95% of the x values lie between —20 = (—2)(6) 
= —12 and 20 = (2)(6) = 12. The values 50 — 12 = 38 and 50 + 12 = 62 
are within two standard deviations from the mean 50. The z-scores are 
—2 and +2 for 38 and 62, respectively. 

About 99.7% of the x values lie within three standard deviations of 
the mean. Therefore, about 95% of the x values lie between —30 = (— 
3)(6) = —18 and 30 = (3)(6) = 18 of the mean 50. The values 50 — 18 = 
32 and 50 + 18 = 68 are within three standard deviations from the 
mean 50. The z-scores are —3 and +3 for 32 and 68, respectively. 
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Chapter Review 


A z-score is a standardized value. Its distribution is the standard normal, Z ~ 
N(O, 1). The mean of the z-scores is zero and the standard deviation is one. 


If z is the z-score for a value x from the normal distribution N(y, o) then z 
tells you how many standard deviations x is above (greater than) or below 
(less than) p. 


Formula Review 

Z ~ N(O, 1) 

z = a Standardized value (z-score) 
mean = 0; standard deviation = 1 


To find the k" percentile of X when the z-scores is known: 
k=p+ (z)o 


_ fe 
z-score: z = 22" orz = el 
(ox (ox 


Z = the random variable for z-scores 


Z ~ N(O, 1) 
Exercise: 
Problem: 


A bottle of water contains 12.05 fluid ounces with a standard deviation 
of 0.01 ounces. Define the random variable X in words. X = 


Solution: 


ounces of water in a bottle 
Exercise: 
Problem: 
A normal distribution has a mean of 61 and a standard deviation of 15. 
What is the median? 


Exercise: 


Problem: X ~ N(1, 2) 
O = 
Solution: 


2 
Exercise: 


Problem: 


A company manufactures rubber balls. The mean diameter of a ball is 
12 cm with a standard deviation of 0.2 cm. Define the random variable 


X in words. X = 


Exercise: 
Problem: X ~ N(-4, 1) 
What is the median? 


Solution: 


_4 


Exercise: 


Problem: X ~ N(3, 5) 


_ 


Exercise: 
Problem: X ~ N(—2, 1) 
U — 


Solution: 


—2 


Exercise: 


Problem: What does a z-score measure? 
Exercise: 


Problem: 


What does standardizing a normal distribution do to the mean? 


Solution: 


The mean becomes zero. 
Exercise: 


Problem: 


Is X ~ N(O, 1) a standardized normal distribution? Why or why not? 
Exercise: 


Problem: 


What is the z-score of x = 12, if it is two standard deviations to the 
right of the mean? 


Solution: 
ya 
Exercise: 


Problem: 


What is the z-score of x = 9, if it is 1.5 standard deviations to the left of 
the mean? 


Exercise: 


Problem: 


What is the z-score of x = —2, if it is 2.78 standard deviations to the 
right of the mean? 


Solution: 


Z=2786 
Exercise: 


Problem: 


What is the z-score of x = 7, if it is 0.133 standard deviations to the left 
of the mean? 


Exercise: 


Problem: Suppose X ~ N(2, 6). What value of x has a z-score of three? 


Solution: 


x= 20 
Exercise: 


Problem: 
Suppose X ~ N(8, 1). What value of x has a z-score of —2.25°? 


Exercise: 


Problem: Suppose X ~ N(9, 5). What value of x has a z-score of —0.5? 


Solution: 


x=6.5 


Exercise: 


Problem: 


Suppose X ~ N(2, 3). What value of x has a z-score of —0.67? 
Exercise: 


Problem: 


Suppose X ~ N(4, 2). What value of x is 1.5 standard deviations to the 
left of the mean? 


Solution: 


x=1 
Exercise: 


Problem: 


Suppose X ~ N(4, 2). What value of x is two standard deviations to the 
right of the mean? 


Exercise: 


Problem: 


Suppose X ~ N(8, 9). What value of x is 0.67 standard deviations to the 
left of the mean? 


Solution: 


x= 1.97 


Exercise: 


Problem: Suppose X ~ N(-1, 2). What is the z-score of x = 2? 


Exercise: 


Problem: Suppose X ~ N(12, 6). What is the z-score of x = 2? 


Solution: 


= —1.67 


Exercise: 


Problem: Suppose X ~ N(9, 3). What is the z-score of x = 9? 
Exercise: 


Problem: 


Suppose a normal distribution has a mean of six and a standard 
deviation of 1.5. What is the z-score of x = 5.5? 


Solution: 


zZ® —0.33 
Exercise: 
Problem: 
In a normal distribution, x = 5 and z = —1.25. This tells you that x = 5 is 
____ standard deviations to the __ (right or left) of the mean. 
Exercise: 
Problem: 


In a normal distribution, x = 3 and z = 0.67. This tells you that x = 3 is 
standard deviations to the (right or left) of the mean. 


Solution: 


0.67, right 


Exercise: 


Problem: 
In a normal distribution, x = —2 and z = 6. This tells you that x = —2 is 
standard deviations to the (right or left) of the mean. 
Exercise: 
Problem: 


In a normal distribution, x = —5 and z = —3.14. This tells you that x = — 
5 is standard deviations to the (right or left) of the mean. 


Solution: 


3.14, left 
Exercise: 
Problem: 
In a normal distribution, x = 6 and z = —1.7. This tells you that x = 6 is 
____ standard deviations to the ___ (right or left) of the mean. 
Exercise: 
Problem: 


About what percent of x values from a normal distribution lie within 
one standard deviation (left and right) of the mean of that distribution? 


Solution: 


about 68% 
Exercise: 
Problem: 
About what percent of the x values from a normal distribution lie 


within two standard deviations (left and right) of the mean of that 
distribution? 


Exercise: 


Problem: 


About what percent of x values lie between the second and third 
standard deviations (both sides)? 


Solution: 


about 4% 
Exercise: 
Problem: 
Suppose X ~ N(15, 3). Between what x values does 68.27% of the data 


lie? The range of x values is centered at the mean of the distribution 
(i.e., 15). 


Exercise: 
Problem: 
Suppose X ~ N(—3, 1). Between what x values does 95.45% of the data 


lie? The range of x values is centered at the mean of the 
distribution(i.e., —3). 


Solution: 


between —5 and —1 
Exercise: 
Problem: 
Suppose X ~ N(—3, 1). Between what x values does 34.14% of the data 
lie? 
Exercise: 
Problem: 


About what percent of x values lie between the mean and three 
standard deviations? 


Solution: 


about 50% 
Exercise: 
Problem: 
About what percent of x values lie between the mean and one standard 
deviation? 
Exercise: 
Problem: 


About what percent of x values lie between the first and second 
standard deviations from the mean (both sides)? 


Solution: 


about 27% 
Exercise: 
Problem: 


About what percent of x values lie betwween the first and third 
standard deviations(both sides)? 


Use the following information to answer the next two exercises: The life of 
Sunshine CD players is normally distributed with mean of 4.1 years anda 

standard deviation of 1.3 years. A CD player is guaranteed for three years. 

We are interested in the length of time a CD player lasts. 

Exercise: 


Problem: 
Define the random variable X in words. X = 


Solution: 


The lifetime of a Sunshine CD player measured in years. 


Exercise: 


Problem: X ~ ( ) 


Homework 


Use the following information to answer the next two exercises: The patient 
recovery time from a particular surgical procedure is normally distributed 
with a mean of 5.3 days and a standard deviation of 2.1 days. 

Exercise: 


Problem: What is the median recovery time? 


a7 
b. 5.3 
c. 7.4 
di. 2.1 


Exercise: 


Problem: 
What is the z-score for a patient who takes ten days to recover? 


a. 1.5 
b0;2 
G22 
de /.3 


Solution: 


C 


Exercise: 


Problem: 


The length of time to find it takes to find a parking space at 9 A.M. 
follows a normal distribution with a mean of five minutes and a 
standard deviation of two minutes. If the mean is significantly greater 
than the standard deviation, which of the following statements is true? 


I. The data cannot follow the uniform distribution. 
II. The data cannot follow the exponential distribution.. 
III. The data cannot follow the normal distribution. 


a. I only 

b. II only 

c. HII only 

d. I, Il, and III 


Exercise: 


Problem: 


The heights of the 430 National Basketball Association players were 
listed on team rosters at the start of the 2005-2006 season. The heights 
of basketball players have an approximate normal distribution with 
mean, pf = 79 inches and a standard deviation, o = 3.89 inches. For 
each of the following heights, calculate the z-score and interpret it 
using complete sentences. 


a. 77 inches 

b. 85 inches 

c. If an NBA player reported his height had a z-score of 3.5, would 
you believe him? Explain your answer. 


Solution: 


a. Use the z-score formula. z = —0.5141. The height of 77 inches is 
0.5141 standard deviations below the mean. An NBA player 
whose height is 77 inches is shorter than average. 


b. Use the z-score formula. z = 1.5424. The height 85 inches is 
1.5424 standard deviations above the mean. An NBA player 
whose height is 85 inches is taller than average. 

c. Height = 79 + 3.5(3.89) = 92.615 inches, which is taller than 7 
feet, 8 inches. There are very few NBA players this tall so the 
answer is no, not likely. 


Exercise: 


Problem: 


The systolic blood pressure (given in millimeters) of males has an 
approximately normal distribution with mean p = 125 and standard 
deviation o = 14. Systolic blood pressure for males follows a normal 
distribution. 


a. Calculate the z-scores for the male systolic blood pressures 100 
and 150 millimeters. 

b. If a male friend of yours said he thought his systolic blood 
pressure was 2.5 standard deviations below the mean, but that he 
believed his blood pressure was between 100 and 150 
millimeters, what would you say to him? 


Exercise: 


Problem: 


Kyle’s doctor told him that the z-score for his systolic blood pressure is 
1.75. Which of the following is the best interpretation of this 
standardized score? The systolic blood pressure (given in millimeters) 
of males has an approximately normal distribution with mean p = 125 
and standard deviation o = 14. If X = a systolic blood pressure score 
then X ~ N (125, 14). 


a. Which answer(s) is/are correct? 


i. Kyle’s systolic blood pressure is 175. 


ii. Kyle’s systolic blood pressure is 1.75 times the average 
blood pressure of men his age. 

iii. Kyle’s systolic blood pressure is 1.75 above the average 
systolic blood pressure of men his age. 

iv. Kyles’s systolic blood pressure is 1.75 standard deviations 
above the average systolic blood pressure for men. 


b. Calculate Kyle’s blood pressure. 


Solution: 


a. iV 
b. Kyle’s blood pressure is equal to 125 + (1.75)(14) = 149.5. 


Exercise: 


Problem: 


Height and weight are two measurements used to track a child’s 
development. The World Health Organization measures child 
development by comparing the weights of children who are the same 
height and the same gender. In 2009, weights for all 80 cm girls in the 
reference population had a mean p = 10.2 kg and standard deviation o 
= 0.8 kg. Weights are normally distributed. X ~ N(10.2, 0.8). Calculate 
the z-scores that correspond to the following weights and interpret 
them. 


a. 11 kg 
b. 7.9 kg 
C122 ke 


Exercise: 


Problem: 


In 2005, 1,475,623 students heading to college took the SAT. The 
distribution of scores in the math section of the SAT follows a normal 
distribution with mean p = 520 and standard deviation o = 115. 


a. Calculate the z-score for an SAT score of 720. Interpret it using a 
complete sentence. 

b. What math SAT score is 1.5 standard deviations above the mean? 
What can you say about this SAT score? 

c. For 2012, the SAT math test had a mean of 514 and standard 
deviation 117. The ACT math test is an alternate to the SAT and 
is approximately normally distributed with mean 21 and standard 
deviation 5.3. If one person took the SAT math test and scored 
700 and a second person took the ACT math test and scored 30, 
who did better with respect to the test they took? 


Solution: 
Let X = an SAT math score and Y = an ACT math score. 


ax =720 0220) = 1.74 The exam score of 720 is 1.74 standard 
deviations above the mean of 520. 

b.z=1.5 
The math SAT score is 520 + 1.5(115) * 692.5. The exam score of 
692.5 is 1.5 standard deviations above the mean of 520. 


x = ie = 
QF = 200 8159. thez-score forthe SAT, =": = 282! 
5 117 a 5.3 


1.70, the z-scores for the ACT. With respect to the test they took, 
the person who took the ACT did better (has the higher z-score). 


Glossary 


Standard Normal Distribution 


a continuous random variable (RV) X ~ N(0, 1); when X follows the 
standard normal distribution, it is often noted as Z ~ N(0, 1). 


Z-SCore 
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the linear transformation of the form z = a or written as z = ul. 


if this transformation is applied to any normal distribution X ~ N(y, 0) 
the result is the standard normal distribution Z ~ N(0,1). If this 
transformation is applied to any specific value x of the RV with mean p 
and standard deviation o, the result is called the z-score of x. The z- 
score allows us to compare data that are normally distributed but 
scaled differently. A z-score is the number of standard deviations a 
particular x is away from its mean value. 


Using the Normal Distribution 


The shaded area in the following graph indicates the area to the right of x. 
This area is represented by the probability P(X > x). Normal tables provide 
the probability between the mean, zero for the standard normal distribution, 
and a specific value such as x. This is the unshaded part of the graph from 


the mean to £1. 
Shaded area 
represents probability 
P (X 2x,) 


Because the normal distribution is symmetrical , if 21 were the same 
distance to the left of the mean the area, probability, in the left tail, would 
be the same as the shaded area in the right tail. Also, bear in mind that 
because of the symmetry of this distribution, one-half of the probability is 
to the right of the mean and one-half is to the left of the mean. 


Calculations of Probabilities 


To find the probability for probability density functions with a continuous 
random variable we need to calculate the area under the function across the 
values of X we are interested in. For the normal distribution this seems a 
difficult task given the complexity of the formula. There is, however, a 
simply way to get what we want. Here again is the formula for the normal 
distribution: 

Equation: 


Looking at the formula for the normal distribution it is not clear just how 
we are going to solve for the probability doing it the same way we did it 
with the previous probability functions. There we put the data into the 
formula and did the math. 


To solve this puzzle we start knowing that the area under a probability 
density function is the probability. 


LU 
PAG EXSxX) 


This shows that the area between X, and X> is the probability as stated in 
the formula: P (X; < x < Xp) 


The mathematical tool needed to find the area under a curve is integral 
calculus. The integral of the normal probability density function between 
the two points x, and X> is the area under the curve between these two 
points and is the probability between these two points. 


Doing these integrals is no fun and can be very time consuming. But now, 
remembering that there are an infinite number of normal distributions out 
there, we can consider the one with a mean of zero and a standard deviation 
of 1. This particular normal distribution is given the name Standard Normal 
Distribution. Putting these values into the formula it reduces to a very 
simple equation. We can now quite easily calculate all probabilities for any 
value of x, for this particular normal distribution, that has a mean of zero 
and a standard deviation of 1. These have been produced and are available 


here in the appendix to the text or everywhere on the web. They are 
presented in various ways. The table in this text is the most common 
presentation and is set up with probabilities for one-half the distribution 
beginning with zero, the mean, and moving outward. The shaded area in the 
graph at the top of the table in Statistical Tables represents the probability 
from zero to the specific Z value noted on the horizontal axis, Z. 


The only problem is that even with this table, it would be a ridiculous 
coincidence that our data had a mean of zero and a standard deviation of 
one. The solution is to convert the distribution we have with its mean and 
standard deviation to this new Standard Normal Distribution. The Standard 
Normal has a random variable called Z. 


Using the standard normal table, typically called the normal table, to find 
the probability of one standard deviation, go to the Z column, reading down 
to 1.0 and then read at column 0. That number, 0.3413 is the probability 
from zero to 1 standard deviation. At the top of the table is the shaded area 
in the distribution which is the probability for one standard deviation. The 
table has solved our integral calculus problem. But only if our data has a 
mean of zero and a standard deviation of 1. 


However, the essential point here is, the probability for one standard 
deviation on one normal distribution is the same on every normal 
distribution. If the population data set has a mean of 10 and a standard 
deviation of 5 then the probability from 10 to 15, one standard deviation, is 
the same as from zero to 1, one standard deviation on the standard normal 
distribution. To compute probabilities, areas, for any normal distribution, 
we need only to convert the particular normal distribution to the standard 
normal distribution and look up the answer in the tables. As review, here 
again is the standardizing formula: 

Equation: 


where Z is the value on the standard normal distribution, X is the value 
from a normal distribution one wishes to convert to the standard normal, 


and o are, respectively, the mean and standard deviation of that population. 
Note that the equation uses p and o which denotes population parameters. 
This is still dealing with probability so we always are dealing with the 
population, with known parameter values and a known distribution. It is 
also important to note that because the normal distribution is symmetrical it 
does not matter if the z-score is positive or negative when calculating a 
probability. One standard deviation to the left (negative Z-score) covers the 
same area as one standard deviation to the right (positive Z-score). This fact 
is why the Standard Normal tables do not provide areas for the left side of 
the distribution. Because of this symmetry, the Z-score formula is 
sometimes written as: 

Equation: 


_ it-H 


Z 


Where the vertical lines in the equation means the absolute value of the 
number. 


What the standardizing formula is really doing is computing the number of 
standard deviations X is from the mean of its own distribution. The 
standardizing formula and the concept of counting standard deviations from 
the mean is the secret of all that we will do in this statistics class. The 
reason this is true is that all of statistics boils down to variation, and the 
counting of standard deviations is a measure of variation. 


This formula, in many disguises, will reappear over and over throughout 
this course. 


Example: 
The final exam scores in a statistics class were normally distributed with a 
mean of 63 and a standard deviation of five. 


Exercise: 


Problem: 


a. Find the probability that a randomly selected student scored more 
than 65 on the exam. 

b. Find the probability that a randomly selected student scored less 
than 85. 


Solution: 


a. Let X = a score on the final exam. X ~ N(63, 5), where p = 63 and o 
=5. 


Draw a graph. 


Then, find P(x > 65). 


P(x > 65) = 0.3446 


Equation: 


Ce ze X1) = PLZ a Z1) =().3446 


The probability that any student selected at random scores more than 
65 is 0.3446. Here is how we found this answer. 


The normal table provides probabilities from zero to the value Z;. For 
this problem the question can be written as: P(X = 65) = P(Z = Z), 
which is the area in the tail. To find this area the formula would be 0.5 
— P(X < 65). One half of the probability is above the mean value 
because this is a symmetrical distribution. The graph shows how to 
find the area in the tail by subtracting that portion from the mean, 
zero, to the Z, value. The final answer is: P(X > 63) = P(Z = 0.4) = 
0.3446 


Z= a8. = 0.4 
Area to the left of Z, to the mean of zero is 0.1554 


P(x > 65) = P(z > 0.4) = 0.5 — 0.1554 = 0.3446 
Exercise: 


Problem: 
Solution: 


b. 


Z = = = 2% = 4.4 which is larger than the maximum value on 


the Standard Normal Table. Therefore, the probability that one student 
scores less than 85 is approximately one or 100%. 


A score of 85 is 4.4 standard deviations from the mean of 63 which is 
beyond the range of the standard normal table. Therefore, the 
probability that one student scores less than 85 is approximately one 


(or 100%). 


Note: 
Try It 
Exercise: 


Problem: 


The golf scores for a school team were normally distributed with a 
mean of 68 and a standard deviation of three. 


Find the probability that a randomly selected golfer scored less than 
65. 


Solution: 


normalcdf(0,65,68,3) = 0.1587 


Example: 

A personal computer is used for office work at home, research, 
communication, personal finances, education, entertainment, social 
networking, and a myriad of other things. Suppose that the average number 
of hours a household personal computer is used for entertainment is two 
hours per day. Assume the times for entertainment are normally distributed 
and the standard deviation for the times is half an hour. 


Exercise: 


Problem: 


a. Find the probability that a household personal computer is used for 
entertainment between 1.8 and 2.75 hours per day. 


Solution: 


a. Let X = the amount of time (in hours) a household personal 
computer is used for entertainment. X ~ N(2, 0.5) where p = 2 and o = 
0:5. 


Find (1:8 = x= 2-75). 


The probability for which you are looking is the area between x = 1.8 
and x = 2.75. P(1.8 < x < 2.75) = 0.5886 


GIES DIS) = INVA Ae) 


The probability that a household personal computer is used between 
1.8 and 2.75 hours per day for entertainment is 0.5886. 


Exercise: 


Problem: 


b. Find the maximum number of hours per day that the bottom 
quartile of households uses a personal computer for entertainment. 


Solution: 


b. To find the maximum number of hours per day that the bottom 
quartile of households uses a personal computer for entertainment, 
find the 25" percentile, k, where P(x < k) = 0.25. 


k=1.66 

Shaded area Unshaded area 
represents probability represents 

P(x <k)=0.25 probability 


P (x >k) =0.75 


f(Z) = 0.5 — 0.25 = 0.25, therefore Z~ —0.675(or just 0.67 using 
the table)Z = —" = £? = —0.675, therefore x 
= —0.675*0.5 + 2 = 1.66 hours. 


The maximum number of hours per day that the bottom quartile of 
households uses a personal computer for entertainment is 1.66 hours. 


Note: 
Try It 
Exercise: 


Problem: 
The golf scores for a school team were normally distributed with a 
mean of 68 and a standard deviation of three. Find the probability that 


a golfer scored between 66 and 70. 


Solution: 


normalcdf(66,70,68,3) = 0.4950 


Example: 

In the United States the ages 13 to 55+ of smartphone users approximately 
follow a normal distribution with approximate mean and standard 
deviation of 36.9 years and 13.9 years, respectively. 


Exercise: 


Problem: 


a. Determine the probability that a random smartphone user in the age 
range 13 to 55+ is between 23 and 64.7 years old. 


Solution: 

a. 0.8186 
Exercise: 

Problem: 


b. Determine the probability that a randomly selected smartphone user 
in the age range 13 to 55+ is at most 50.8 years old. 


Solution: 


b. 0.8413 


Example: 
A citrus farmer who grows mandarin oranges finds that the diameters of 
mandarin oranges harvested on his farm follow a normal distribution with 
a mean diameter of 5.85 cm and a standard deviation of 0.24 cm. 
Exercise: 

Problem: 


a. Find the probability that a randomly selected mandarin orange from 
this farm has a diameter larger than 6.0 cm. Sketch the graph. 


Solution: 


Equation: 


— 6=5.85 


— 62 
= 24 ozo 


P(x > 6) = P(z = 0.625) = 0.2670 


b. The middle 20% of mandarin oranges from this farm have 
diameters between and 


f(Z) = % =0.10, therefore Z = +£0.25 
Ff ee ae a UL tists == (Ree) 
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Introduction 
class="introduction" 


If you 
want to 
figure out 
the 
distributio 
n of the 
change 
people 
carry in 
their 
pockets, 
using the 
Central 
Limit 
Theorem 
and 
assuming 
your 
sample is 
large 
enough, 
you will 
find that 
the 
distributio 
n is the 
normal 
probability 
density 
function. 
(credit: 
John 
Lodder) 


Why are we so concerned with means? Two reasons are: they give us a 
middle ground for comparison, and they are easy to calculate. In this 
chapter, you will study means and the Central Limit Theorem. 


The Central Limit Theorem is one of the most powerful and useful ideas 
in all of statistics. The Central Limit Theorem is a theorem which means 
that it is NOT a theory or just somebody's idea of the way things work. As a 
theorem it ranks with the Pythagorean Theorem, or the theorem that tells us 
that the sum of the angles of a triangle must add to 180. These are facts of 
the ways of the world rigorously demonstrated with mathematical precision 
and logic. As we will see this powerful theorem will determine just what we 
can, and cannot say, in inferential statistics. The Central Limit Theorem is 
concerned with drawing finite samples of size n from a population with a 
known mean, p, and a known standard deviation, o. The conclusion is that if 
we collect samples of size n with a "large enough n," calculate each 
sample's mean, and create a histogram (distribution) of those means, then 
the resulting distribution will tend to have an approximate normal 
distribution. 


The astounding result is that it does not matter what the distribution of 
the original population is, or whether you even need to know it. The 
important fact is that the distribution of sample means tend to follow 
the normal distribution. 


The size of the sample, n, that is required in order to be "large enough" 
depends on the original population from which the samples are drawn (the 
sample size should be at least 30 or the data should come from a normal 
distribution). If the original population is far from normal, then more 
observations are needed for the sample means. Sampling is done 
randomly and with replacement in the theoretical model. 


Glossary 


Sampling Distribution 
Given simple random samples of size n from a given population with a 
measured characteristic such as mean, proportion, or standard 
deviation for each sample, the probability distribution of all the 
measured characteristics is called a sampling distribution. 


The Central Limit Theorem for Sample Means 


The sampling distribution is a theoretical distribution. It is created by taking 
many many samples of size n from a population. Each sample mean is then 
treated like a single observation of this new distribution, the sampling 
distribution. The genius of thinking this way is that it recognizes that when 
we sample we are creating an observation and that observation must come 
from some particular distribution. The Central Limit Theorem answers the 
question: from what distribution did a sample mean come? If this is 
discovered, then we can treat a sample mean just like any other observation 
and calculate probabilities about what values it might take on. We have 
effectively moved from the world of statistics where we know only what we 
have from the sample, to the world of probability where we know the 
distribution from which the sample mean came and the parameters of that 
distribution. 


The reasons that one samples a population are obvious. The time and 
expense of checking every invoice to determine its validity or every 
shipment to see if it contains all the items may well exceed the cost of 
errors in billing or shipping. For some products, sampling would require 
destroying them, called destructive sampling. One such example is 
measuring the ability of a metal to withstand saltwater corrosion for parts 
on ocean going vessels. 


Sampling thus raises an important question; just which sample was drawn. 
Even if the sample were randomly drawn, there are theoretically an almost 
infinite number of samples. With just 100 items, there are more than 75 
million unique samples of size five that can be drawn. If six are in the 
sample, the number of possible samples increases to just more than one 
billion. Of the 75 million possible samples, then, which one did you get? If 
there is variation in the items to be sampled, there will be variation in the 
samples. One could draw an "unlucky" sample and make very wrong 
conclusions concerning the population. This recognition that any sample we 
draw is really only one from a distribution of samples provides us with what 
is probably the single most important theorem is statistics: the Central 
Limit Theorem. Without the Central Limit Theorem it would be 
impossible to proceed to inferential statistics from simple probability 


theory. In its most basic form, the Central Limit Theorem states that 
regardless of the underlying probability density function of the population 
data, the theoretical distribution of the means of samples from the 
population will be normally distributed. In essence, this says that the mean 
of a sample should be treated like an observation drawn from a normal 
distribution. The Central Limit Theorem only holds if the sample size is 
"large enough" which has been shown to be only 30 observations or more. 


[link] graphically displays this very important proposition. 


Population 
Distribution 


Sampling 
Distribution 


H, 


Notice that the horizontal axis in the top panel is labeled X. These are the 
individual observations of the population. This is the unknown distribution 
of the population values. The graph is purposefully drawn all squiggly to 
show that it does not matter just how odd ball it really is. Remember, we 
will never know what this distribution looks like, or its mean or standard 
deviation for that matter. 


The horizontal axis in the bottom panel is labeled X's. This is the 
theoretical distribution called the sampling distribution of the means. Each 
observation on this distribution is a sample mean. All these sample means 
were calculated from individual samples with the same sample size. The 
theoretical sampling distribution contains all of the sample mean values 
from all the possible samples that could have been taken from the 
population. Of course, no one would ever actually take all of these samples, 
but if they did this is how they would look. And the Central Limit Theorem 
says that they will be normally distributed. 


The Central Limit Theorem goes even further and tells us the mean and 
standard deviation of this theoretical distribution. 


Population Sampling 7 
Parameter distribution Sample distribution of X's 
Mean Mm xX fz and B(us) = pb 
Standard P ‘ ga 
deviation Wet ght 


The practical significance of The Central Limit Theorem is that now we can 


compute probabilities for drawing a sample mean, X, in just the same way 
as we did for drawing specific observations, X's, when we knew the 
population mean and standard deviation and that the population data were 
normally distributed.. The standardizing formula has to be amended to 
recognize that the mean and standard deviation of the sampling distribution, 
sometimes, called the standard error of the mean, are different from those of 
the population distribution, but otherwise nothing has changed. The new 
standardizing formula is 

Equation: 


Notice that 1x in the first formula has been changed to simply p in the 
second version. The reason is that mathematically it can be shown that the 
expected value of j1; is equal to py. This was stated in [link] above. 
Mathematically, the E(x) symbol read the “expected value of x”. This 
formula will be used in the next unit to provide estimates of the unknown 
population parameter i. 
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Chapter Review 


In a population whose distribution may be known or unknown, if the size 
(n) of samples is sufficiently large, the distribution of the sample means will 
be approximately normal. The mean of the sample means will equal the 
population mean. The standard deviation of the distribution of the sample 
means, called the standard error of the mean, is equal to the population 
standard deviation divided by the square root of the sample size (n). 


Formula Review 


The Central Limit Theorem for Sample Means: 


The Mean X : bie 


T Me; 


() 


Standard Error of the Mean (Standard Deviation (X)): Va 


Central Limit Theorem for Sample Means z-score z = 


Finite Population Correction Factor for the sampling distribution of means: 
| ed a 

oO 4/2 

Vn N-1 


Finite Population Correction Factor for the sampling distribution of 


Baa _ p(1—p) N—n 
proportions: op = \/ ——— x 4/ NEL 


Homework 


Exercise: 
Problem: 
Previously, De Anza statistics students estimated that the amount of 
change daytime statistics students carry is exponentially distributed 


with a mean of $0.88. Suppose that we randomly pick 25 daytime 
Statistics students. 


a. In words, X = 


b.X~ oes ; ) 
CG In words, X = 
d. X ~ ( ; ) 


e. Find the probability that an individual had between $0.80 and 
$1.00. Graph the situation, and shade in the area to be determined. 


f. Find the probability that the average of the 25 students was 
between $0.80 and $1.00. Graph the situation, and shade in the 
area to be determined. 

g. Explain why there is a difference in part e and part f. 


Solution: 


a. X = amount of change students carry 

b. X ~ E(0.88, 0.88) 

c. X = average amount of change carried by a sample of 25 
students. 


d. X ~ N(0.88, 0.176) 
e. 0.0819 
f. 0.1882 


g. The distributions are different. Part a is exponential and part b is 
normal. 


Exercise: 


Problem: 


Suppose that the distance of fly balls hit to the outfield (in baseball) is 
normally distributed with a mean of 250 feet and a standard deviation 
of 50 feet. We randomly sample 49 fly balls. 


alt x= average distance in feet for 49 fly balls, then X~ 


b. What is the probability that the 49 balls traveled an average of 
less than 240 feet? Sketch the graph. Scale the horizontal axis for 
X. Shade the region corresponding to the probability. Find the 
probability. 

c. Find the 80" percentile of the distribution of the average of 49 fly 
balls. 


Exercise: 


Problem: 


According to the Internal Revenue Service, the average length of time 
for an individual to complete (keep records for, learn, prepare, copy, 
assemble, and send) IRS Form 1040 is 10.53 hours (without any 
attached schedules). The distribution is unknown. Let us assume that 
the standard deviation is two hours. Suppose we randomly sample 36 
taxpayers. 


a. In words, X = 

b. In words, X = 

aX ( ) 

d. Would you be surprised if the 36 taxpayers finished their Form 
1040s in an average of more than 12 hours? Explain why or why 
not in complete sentences. 

e. Would you be surprised if one taxpayer finished his or her Form 
1040 in more than 12 hours? In a complete sentence, explain why. 
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Solution: 


a. length of time for an individual to complete IRS form 1040, in 
hours. 

b. mean length of time for a sample of 36 taxpayers to complete IRS 
form 1040, in hours. 

c. (10.53, +) 

d. Yes. I would be surprised, because the probability is almost 0. 

e. No. I would not be totally surprised because the probability is 
0.2312 


Exercise: 


Problem: 


Suppose that a category of world-class runners are known to run a 
marathon (26 miles) in an average of 145 minutes with a standard 


deviation of 14 minutes. Consider 49 of the races. Let X the average 
of the 49 races. 


a. X~ ( ) 

b. Find the probability that the runner will average between 142 and 
146 minutes in these 49 marathons. 

c. Find the 80" percentile for the average of these 49 marathons. 

d. Find the median of the average running times. 
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Exercise: 
Problem: 
The length of songs in a collector’s iTunes album collection is 
uniformly distributed from two to 3.5 minutes. Suppose we randomly 


pick five albums from the collection. There are a total of 43 songs on 
the five albums. 


a. In words, X = 


b.X~ 
G: In words, X = 
dx~_ ( ; ) 


e. Find the first quartile for the average song length. 
f. The IQR(interquartile range) for the average song length is from 


Solution: 


a. the length of a song, in minutes, in the collection 

by U(2,3.5) 

c. the average length, in minutes, of the songs from a sample of five 
albums from the collection 


d. N(2.75, 0.066) 
e. 2.74 minutes 
f. 0.03 minutes 


Exercise: 


Problem: 


In 1940 the average size of a U.S. farm was 174 acres. Let’s say that 
the standard deviation was 55 acres. Suppose we randomly survey 38 
farmers from 1940. 


a. In words, X = 


b. In words, x = 
ome, Gia (__ ) 
d. The IQR for X is from acres to acres. 
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Exercise: 


Problem: 


Determine which of the following are true and which are false. Then, 
in complete sentences, justify your answers. 


a. When the sample size is large, the mean of X is approximately 
equal to the mean of X. 


b. When the sample size is large, X is approximately normally 
distributed. 


c. When the sample size is large, the standard deviation of X is 
approximately the same as the standard deviation of X. 
Solution: 


a. True. The mean of a sampling distribution of the means is 
approximately the mean of the data distribution. 


b. True. According to the Central Limit Theorem, the larger the 
sample, the closer the sampling distribution of the means 
becomes normal. 

c. The standard deviation of the sampling distribution of the means 
will decrease making it approximately the same as the standard 
deviation of X as the sample size increases. 


Exercise: 


Problem: 


The percent of fat calories that a person in America consumes each 
day is normally distributed with a mean of about 36 and a standard 
deviation of about ten. Suppose that 16 individuals are randomly 


chosen. Let X = average percent of fat calories. 


a. X~ ( ; 
b. For the group of 16, find the probability that the average percent 
of fat calories consumed is more than five. Graph the situation 

and shade in the area to be determined. 
c. Find the first quartile for the average percent of fat calories. 


Exercise: 


Problem: 


The distribution of income in some Third World countries is 
considered wedge shaped (many very poor people, very few middle 
income people, and even fewer wealthy people). Suppose we pick a 
country with a wedge shaped distribution. Let the average salary be 
$2,000 per year with a standard deviation of $8,000. We randomly 
survey 1,000 residents of that country. 


a. In words, X = 
b. In words, x = 
ac xX~ ( ) 


) 


d. How is it possible for the standard deviation to be greater than the 
average? 

e, Why is it more likely that the average of the 1,000 residents will 
be from $2,000 to $2,100 than from $2,100 to $2,200? 


Solution: 


a. X = the yearly income of someone in a third world country 
b. the average salary from samples of 1,000 residents of a third 
world country 


>” 8000 
o X n (2000, -$202- ) 


d. Very wide differences in data values can have averages smaller 
than standard deviations. 

e. The distribution of the sample mean will have higher probabilities 
closer to the population mean. 


P(2000 < X < 2100) = 0.1537 
P(2100 < X < 2200) = 0.1317 


Exercise: 
Problem: 


Which of the following is NOT TRUE about the distribution for 
averages? 


a. The mean, median, and mode are equal. 
b. The area under the curve is one. 

c. The curve never touches the x-axis. 

d. The curve is skewed to the right. 


Exercise: 


Problem: 


The cost of unleaded gasoline in the Bay Area once followed an 
unknown distribution with a mean of $4.59 and a standard deviation of 
$0.10. Sixteen gas stations from the Bay Area are randomly chosen. 
We are interested in the average cost of gasoline for the 16 gas 
stations. The distribution to use for the average cost of gasoline for the 
16 gas stations is: 


a. X ~ N(4.59, 0.10) 
ae 0.10 
b.X N(4.59, tet ) 
¥ 16 
y 16 
a. X~N(4.59, 436 ] 
Solution: 
b 
Glossary 
Average 


a number that describes the central tendency of the data; there are a 
number of specialized averages, including the arithmetic mean, 
weighted mean, median, mode, and geometric mean. 


Central Limit Theorem 
Given a random variable with known mean p and known standard 
deviation, 0, we are sampling with size n, and we are interested in two 


new RVs: the sample mean, X. If the size (n) of the sample is 
sufficiently large, then X ~ N(u, Va If the size (n) of the sample is 


sufficiently large, then the distribution of the sample means will 
approximate a normal distributions regardless of the shape of the 


population. The mean of the sample means will equal the population 


mean. The standard deviation of the distribution of the sample means, 


——., is called the standard error of the mean. 


Jn?’ 


Standard Error of the Mean 


the standard deviation of the distribution of the sample means, or —% 


Vn’ 


Using the Central Limit Theorem 


Examples of the Central Limit Theorem 


Law of Large Numbers 


The law of large numbers says that if you take samples of larger and larger 
size from any population, then the mean of the sampling distribution, juz 
tends to get closer and closer to the true population mean, p. From the 
Central Limit Theorem, we know that as n gets larger and larger, the sample 
means follow a normal distribution. The larger n gets, the smaller the 


standard deviation of the sampling distribution gets. (Remember that the 


standard deviation for the sampling distribution of X is -~.) This means 


Ja 

that the sample mean z must be closer to the population mean p as n 
increases. We can say that p/ is the value that the sample means approach as 
n gets larger. The Central Limit Theorem illustrates the law of large 
numbers. 


This concept is so important and plays such a critical role in what follows it 
deserves to be developed further. Indeed, there are two critical issues that 
flow from the Central Limit Theorem and the application of the Law of 
Large numbers to it. These are 


1. The probability density function of the sampling distribution of means 
is normally distributed regardless of the underlying distribution of the 
population observations and 

2. standard deviation of the sampling distribution decreases as the size of 
the samples that were used to calculate the means for the sampling 
distribution increases. 


Taking these in order. It would seem counterintuitive that the population 
may have any distribution and the distribution of means coming from it 
would be normally distributed. With the use of computers, experiments can 
be simulated that show the process by which the sampling distribution 
changes as the sample size is increased. These simulations show visually 
the results of the mathematical proof of the Central Limit Theorem. 


Here are three examples of very different population distributions and the 
evolution of the sampling distribution to a normal distribution as the sample 
size increases. The top panel in these cases represents the histogram for the 
original data. The three panels show the histograms for 1,000 randomly 
drawn samples for different sample sizes: n=10, n= 25 and n=50. As the 
sample size increases, and the number of samples taken remains constant, 
the distribution of the 1,000 sample means becomes closer to the smooth 
line that represents the normal distribution. 


[link] is for a normal distribution of individual observations and we would 
expect the sampling distribution to converge on the normal quickly. The 
results show this and show that even at a very small sample size the 
distribution is close to the normal distribution. 
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[link] is a uniform distribution which, a bit amazingly, quickly approached 
the normal distribution even with only a sample of 10. 
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[link] is a skewed distribution. This last one could be an exponential, 
geometric, or binomial with a small probability of success creating the skew 
in the distribution. For skewed distributions our intuition would say that this 
will take larger sample sizes to move to a normal distribution and indeed 
that is what we observe from the simulation. Nevertheless, at a sample size 
of 50, not considered a very large sample, the distribution of sample means 
has very decidedly gained the shape of the normal distribution. 
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The Central Limit Theorem provides more than the proof that the sampling 
distribution of means is normally distributed. It also provides us with the 
mean and standard deviation of this distribution. Further, as discussed 
above, the expected value of the mean, [I;, is equal to the mean of the 


population of the original data which is what we are interested in estimating 
from the sample we took. We have already inserted this conclusion of the 
Central Limit Theorem into the formula we use for standardizing from the 
sampling distribution to the standard normal distribution. And finally, the 
Central Limit Theorem has also provided the standard deviation of the 


sampling distribution, 0; = Fi and this is critical to have to calculate 
probabilities of values of the new random variable, z. 


[link] shows a sampling distribution. The mean has been marked on the 
horizontal axis of the z's and the standard deviation has been written to the 
right above the distribution. Notice that the standard deviation of the 
sampling distribution is the original standard deviation of the population, 
divided by the sample size. We have already seen that as the sample size 
increases the sampling distribution becomes closer and closer to the normal 
distribution. As this happens, the standard deviation of the sampling 
distribution changes in another way; the standard deviation decreases as n 
increases. At very very large n, the standard deviation of the sampling 
distribution becomes very small and at infinity it collapses on top of the 
population mean. This is what it means that the expected value of p; is the 
population mean, pL. 


Hy; 
E(u,) =u 


At non-extreme values of n,this relationship between the standard deviation 
of the sampling distribution and the sample size plays a very important part 
in our ability to estimate the parameters we are interested in. 


[link] shows three sampling distributions. The only change that was made is 
the sample size that was used to get the sample means for each distribution. 
As the sample size increases, n goes from 10 to 30 to 50, the standard 
deviations of the respective sampling distributions decrease because the 
sample size is in the denominator of the standard deviations of the sampling 
distributions. 


The implications for this are very important. [link] shows the effect of the 
sample size on the confidence we will have in our estimates. These are two 
sampling distributions from the same population. One sampling distribution 
was created with samples of size 10 and the other with samples of size 50. 
All other things constant, the sampling distribution with sample size 50 has 
a smaller standard deviation that causes the graph to be higher and 
narrower. The important effect of this is that for the same probability of one 
standard deviation from the mean, this distribution covers much less of a 
range of possible values than the other distribution. One standard deviation 


is marked on the X axis for each distribution. This is shown by the two 
arrows that are plus or minus one standard deviation for each distribution. If 
the probability that the true mean is one standard deviation away from the 
mean, then for the sampling distribution with the smaller sample size, the 
possible range of values is much greater. A simple question is, would you 
rather have a sample mean from the narrow, tight distribution, or the flat, 
wide distribution as the estimate of the population mean? Your answer tells 


us why people intuitively will always choose data from a large sample 
rather than a small sample. The sample mean they are getting is coming 
from a more compact distribution. This concept will be the foundation for 
what will be called level of confidence in the next unit. 
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Chapter Review 


The Central Limit Theorem can be used to illustrate the law of large 
numbers. The law of large numbers states that the larger the sample size 
you take from a population, the closer the sample mean z gets to p. 


Use the following information to answer the next ten exercises: A 
manufacturer produces 25-pound lifting weights. The lowest actual weight 
is 24 pounds, and the highest is 26 pounds. Each weight is equally likely so 
the distribution of weights is uniform. A sample of 100 weights is taken. 
Exercise: 


Problem: 


a. What is the distribution for the weights of one 25-pound lifting 
weight? What is the mean and standard deivation? 

b. What is the distribution for the mean weight of 100 25-pound 
lifting weights? 


c. Find the probability that the mean actual weight for the 100 
weights is less than 24.9. 


Solution: 


a. U(24, 26), 25, 0.5774 
b. N(25, 0.0577) 
c. 0.0416 


Exercise: 


Problem: Draw the graph from [link] 
Exercise: 


Problem: 


Find the probability that the mean actual weight for the 100 weights is 
greater than 25.2. 


Solution: 


0.0003 


Exercise: 


Problem: Draw the graph from [link] 
Exercise: 


Problem: 
Find the 90" percentile for the mean weight for the 100 weights. 
Solution: 


25.07 


Exercise: 


Problem: Draw the graph from [link] 
Exercise: 


Problem: 
a. What is the distribution for the sum of the weights of 100 25- 
pound lifting weights? 
b. Find P(Zx < 2,450). 
Solution: 


a. N(2,500, 5.7735) 
b. 0 


Exercise: 


Problem: Draw the graph from [link] 
Exercise: 


Problem: 

Find the 90" percentile for the total weight of the 100 weights. 
Solution: 

2,507.40 


Exercise: 


Problem: Draw the graph from [link] 


Use the following information to answer the next five exercises: The length 
of time a particular smartphone's battery lasts follows an exponential 


distribution with a mean of ten months. A sample of 64 of these 
smartphones is taken. 
Exercise: 


Problem: 


a. What is the standard deviation? 
b. What is the parameter m? 


Solution: 


a. 10 


ls 
b. 10 


Exercise: 
Problem: 


What is the distribution for the length of time one battery lasts? 
Exercise: 


Problem: 


What is the distribution for the mean length of time 64 batteries last? 


Solution: 
10 
N(10, 2) 
Exercise: 


Problem: 


What is the distribution for the total length of time 64 batteries last? 
Exercise: 


Problem: 


Find the probability that the sample mean is between seven and 11. 


Solution: 


0.7799 
Exercise: 


Problem: 
Find the 80" percentile for the total length of time 64 batteries last. 


Exercise: 


Problem:Find the JQR for the mean amount of time 64 batteries last. 


Solution: 


1.69 
Exercise: 


Problem: 


Find the middle 80% for the total amount of time 64 batteries last. 


Use the following information to answer the next eight exercises: A uniform 
distribution has a minimum of six and a maximum of ten. A sample of 50 is 
taken. 

Exercise: 


Problem: Find P(2x > 420). 


Solution: 


0.0072 


Exercise: 


Problem: Find the 90" percentile for the sums. 


Exercise: 


Problem: Find the 15" percentile for the sums. 


Solution: 


391.54 


Exercise: 


Problem: Find the first quartile for the sums. 


Exercise: 


Problem:Find the third quartile for the sums. 


Solution: 


405.51 


Exercise: 


Problem:Find the 80" percentile for the sums. 
Exercise: 
Problem: 
A population has a mean of 25 and a standard deviation of 2. If it is 


sampled repeatedly with samples of size 49, what is the mean and 
standard deviation of the sample means? 


Solution: 


Mean = 25, standard deviation = 2/7 


Exercise: 


Problem: 


A population has a mean of 48 and a standard deviation of 5. If it is 
sampled repeatedly with samples of size 36, what is the mean and 
standard deviation of the sample means? 


Solution: 


Mean = 48, standard deviation = 5/6 
Exercise: 
Problem: 
A population has a mean of 90 and a standard deviation of 6. If it is 


sampled repeatedly with samples of size 64, what is the mean and 
standard deviation of the sample means? 


Solution: 


Mean = 90, standard deviation = 3/4 
Exercise: 
Problem: 
A population has a mean of 120 and a standard deviation of 2.4. If it is 


sampled repeatedly with samples of size 40, what is the mean and 
standard deviation of the sample means? 


Solution: 


Mean = 120, standard deviation = 0.38 
Exercise: 
Problem: 
A population has a mean of 17 and a standard deviation of 1.2. If it is 


sampled repeatedly with samples of size 50, what is the mean and 
standard deviation of the sample means? 


Solution: 


Mean = 17, standard deviation = 0.17 
Exercise: 
Problem: 
A population has a mean of 17 and a standard deviation of 0.2. If it is 


sampled repeatedly with samples of size 16, what is the expected value 
and standard deviation of the sample means? 


Solution: 


Expected value = 17, standard deviation = 0.05 
Exercise: 
Problem: 
A population has a mean of 38 and a standard deviation of 3. If it is 


sampled repeatedly with samples of size 48, what is the expected value 
and standard deviation of the sample means? 


Solution: 


Expected value = 38, standard deviation = 0.43 

Exercise: 
Problem: 
A population has a mean of 14 and a standard deviation of 5. If it is 
sampled repeatedly with samples of size 60, what is the expected value 
and standard deviation of the sample means? 


Solution: 


Expected value = 14, standard deviation = 0.65 


Homework 


Exercise: 


Problem: 


A large population of 5,000 students take a practice test to prepare for 
a standardized test. The population mean is 140 questions correct, and 
the standard deviation is 80. What size samples should a researcher 
take to get a distribution of means of the samples with a standard 
deviation of 10? 


Solution: 


64 
Exercise: 


Problem: 


A large population has skewed data with a mean of 70 and a standard 
deviation of 6. Samples of size 100 are taken, and the distribution of 
the means of these samples is analyzed. 


a. Will the distribution of the means be closer to a normal 
distribution than the distribution of the population? 

b. Will the mean of the means of the samples remain close to 70? 

c. Will the distribution of the means have a smaller standard 
deviation? 

d. What is that standard deviation? 


Solution: 


a. Yes 
b. Yes 
c. Yes 
d. 0.6 


Exercise: 


Problem: 


A researcher is looking at data from a large population with a standard 
deviation that is much too large. In order to concentrate the 
information, the researcher decides to repeatedly sample the data and 
use the distribution of the means of the samples? The first effort used 
sample sized of 100. But the standard deviation was about double the 
value the researcher wanted. What is the smallest size samples the 
researcher can use to remedy the problem? 


Solution: 


400 
Exercise: 


Problem: 


A researcher looks at a large set of data, and concludes the population 
has a standard deviation of 40. Using sample sizes of 64, the 
researcher is able to focus the mean of the means of the sample to a 
narrower distribution where the standard deviation is 5. Then, the 
researcher realizes there was an error in the original calculations, and 
the initial standard deviation is really 20. Since the standard deviation 
of the means of the samples was obtained using the original standard 
deviation, this value is also impacted by the discovery of the error. 
What is the correct value of the standard deviation of the means of the 
samples? 


Solution: 


2.5 
Exercise: 


Problem: 


A population has a standard deviation of 50. It is sampled with 
samples of size 100. What is the variance of the means of the samples? 


Solution: 


25 


Glossary 


Mean 
a number that measures the central tendency; a common name for 
mean is "average." The term "mean" is a shortened form of "arithmetic 
mean." By definition, the mean for a sample (denoted by 2) is 
= Sum of all values in the sample : 
~*~, and the mean for a population 


— ‘Number of values in the sample 
Sum of all values in the population 
Number of values in the population * 


(denoted by p/) is uw = 


Finite Population Correction Factor 
adjusts the variance of the sampling distribution if the population is 
known and more than 5% of the population is being sampled. 


Normal Distribution 
a continuous random variable with pdf f(z) = ee € 2% , where p 
(ox TT 
is the mean of the distribution and o is the standard deviation.; 
notation: X ~ N(p, 0). If u = 0 and o = 1, the random variable, Z, is 


called the standard normal distribution. 


Standard Error of the Proportion 
the standard deviation of the sampling distribution of proportions 


Introduction 
class="introduction" 


Have you ever 
wondered what the 
average number of 
M&Ms in a bag at 

the grocery store is? 
You can use 
confidence intervals 
to answer this 
question. (credit: 
comedy_nose/flickr 


Suppose you were trying to determine the mean rent of a two-bedroom 
apartment in your town. You might look in the classified section of the 


newspaper, write down several rents listed, and average them together. You 
would have obtained a point estimate of the true mean. If you are trying to 
determine the percentage of times you make a basket when shooting a 
basketball, you might count the number of shots you make and divide that 
by the number of shots you attempted. In this case, you would have 
obtained a point estimate for the true proportion the parameter p in the 
binomial probability density function. 


We use sample data to make generalizations about an unknown population. 
This part of statistics is called inferential statistics. The sample data help 
us to make an estimate of a population parameter. We realize that the 
point estimate is most likely not the exact value of the population 
parameter, but close to it. After calculating point estimates, we construct 
interval estimates, called confidence intervals. What statistics provides us 
beyond a simple average, or point estimate, is an estimate to which we can 
attach a probability of accuracy, what we will call a confidence level. We 
make inferences with a known level of probability. 


In this chapter, you will learn to construct and interpret confidence 
intervals. You will also learn a new distribution, the Student's-t, and how it 
is used with these intervals. Throughout the chapter, it is important to keep 
in mind that the confidence interval is a random variable. It is the 
population parameter that is fixed. 


If you worked in the marketing department of an entertainment company, 
you might be interested in the mean number of songs a consumer 
downloads a month from iTunes. If so, you could conduct a survey and 
calculate the sample mean, x, and the sample standard deviation, s. You 
would use x to estimate the population mean and s to estimate the 
population standard deviation. The sample mean, 2, is the point estimate 
for the population mean, pp. The sample standard deviation, s, is the point 
estimate for the population standard deviation, o. 


xz and s are each called a statistic. 


A confidence interval is another type of estimate but, instead of being just 
one number, it is an interval of numbers. The interval of numbers is a range 


of values calculated from a given set of sample data. The confidence 
interval is likely to include the unknown population parameter. 


Suppose, for the iTunes example, we do not know the population mean p/, 
but we do know that the population standard deviation is o = 1 and our 
sample size is 100. Then, by the central limit theorem, the standard 
deviation of the sampling distribution of the sample means is 


o 1 
Vn = “V100 — 0.1. 
The empirical rule, which applies to the normal distribution, says that in 
approximately 95% of the samples, the sample mean, z, will be within two 
standard deviations of the population mean p. For our iTunes example, two 
standard deviations is (2)(0.1) = 0.2. The sample mean z is likely to be 
within 0.2 units of p. 


Because x is within 0.2 units of p1, which is unknown, then p is likely to be 
within 0.2 units of z with 95% probability. The population mean p is 
contained in an interval whose lower number is calculated by taking the 
sample mean and subtracting two standard deviations (2)(0.1) and whose 
upper number is calculated by taking the sample mean and adding two 
standard deviations. In other words, p: is between x — 0.2 andz + 0.2 in 
95% of all the samples. 


For the iTunes example, suppose that a sample produced a sample mean 
x = 2. Then with 95% probability the unknown population mean p is 
between 


e—0.2=2-—0.2=1.8 andz+0.2=2+02.> 2.2 


We say that we are 95% confident that the unknown population mean 
number of songs downloaded from iTunes per month is between 1.8 and 
2.2. The 95% confidence interval is (1.8, 2.2). Please note that we talked 
in terms of 95% confidence using the empirical rule. The empirical rule for 
two standard deviations is only approximately 95% of the probability under 
the normal distribution. To be precise, two standard deviations under a 


normal distribution is actually 95.44% of the probability. To calculate the 
exact 95% confidence level we would use 1.96 standard deviations. 


The 95% confidence interval implies two possibilities. Either the interval 
(1.8, 2.2) contains the true mean p, or our sample produced an = that is not 
within 0.2 units of the true mean p. The second possibility happens for only 
5% of all the samples (95% minus 100% = 5%). 


Remember that a confidence interval is created for an unknown population 
parameter like the population mean, . 


For the confidence interval for a mean the formula would be: 
Equation: 


Or written another way as: 
Equation: 


V Oo Vy oO 
K-20] eens k+ 20°] yp 


Where X is the sample mean. Z, is determined by the level of confidence 
desired by the analyst, and a Vn is the standard deviation of the sampling 
distribution for means given to us by the Central Limit Theorem. 


Glossary 
Confidence Interval (CI) 
an interval estimate for an unknown population parameter. This 


depends on: 


e the desired confidence level, 


e information that is known about the distribution (for example, 
known standard deviation), 
e the sample and its size. 


Inferential Statistics 
also called statistical inference or inductive statistics; this facet of 
Statistics deals with estimating a population parameter based on a 
sample statistic. For example, if four out of the 100 calculators 
sampled are defective we might infer that four percent of the 
production is defective. 


Parameter 
a numerical characteristic of a population 


Point Estimate 
a single number computed from a sample and used to estimate a 
population parameter 


A Confidence Interval for a Population Standard Deviation, Known or 
Large Sample Size 


A confidence interval for a population mean with a known population 
standard deviation is based on the conclusion of the Central Limit Theorem 
that the sampling distribution of the sample means follow an approximately 
normal distribution. 


Calculating the Confidence Interval 


Consider the standardizing formula for the sampling distribution developed 
in the discussion of the Central Limit Theorem: 
Equation: 


a a 
Z,= —— = = L 
x /va 


Notice that ) is substituted for L- because we know that the expected value 
of Le is from the Central Limit theorem and o- is replaced with ey J also 
from the Central Limit Theorem. 


In this formula we know X 1 o- and n, the sample size. (In actuality we do 


not know the population standard deviation, but we do have a point estimate 
for it, s, from the sample we took. More on this later.) What we do not 
know is p or Z;. We can solve for either one of these in terms of the other. 
Solving for p in terms of Z, gives: 

Equation: 


Remembering that the Central Limit Theorem tells us that the distribution 


of the X's, the sampling distribution for means, is normal, and that the 
normal distribution is symmetrical, we can rearrange terms thus: 


Equation: 
x za(°/ a) ee za(°/ a) 


This is the formula for a confidence interval for the mean of a population. 


Notice that Z, has been substituted for Z, in this equation. This is where a 
choice must be made by the statistician. The analyst must decide the level 
of confidence they wish to impose on the confidence interval. a is the 
probability that the interval will not contain the true population mean. The 
confidence level is defined as (1-a). Z,, is the number of standard deviations 


X lies from the mean with a certain probability. If we chose Z, = 1.96 we 
are asking for the 95% confidence interval because we are setting the 
probability that the true mean lies within the range at 0.95. If we set Z, at 
1.64 we are asking for the 90% confidence interval because we have set the 
probability at 0.90. These numbers can be verified by consulting the 
Standard Normal table. Divide either 0.95 or 0.90 in half and find that 
probability inside the body of the table. Then read on the top and left 
margins the number of standard deviations it takes to get this level of 
probability. 


In reality, we can set whatever level of confidence we desire simply by 
changing the Z, value in the formula. It is the analyst's choice. Common 
convention in Economics and most social sciences sets confidence intervals 
at either 90, 95, or 99 percent levels. Levels less than 90% are considered of 
little value. The level of confidence of a particular interval estimate is called 
by (1-a). 


A good way to see the development of a confidence interval is to 
graphically depict the solution to a problem requesting a confidence 
interval. This is presented in [link] for the example in the introduction 
concerning the number of downloads from iTunes. That case was for a 95% 


confidence interval, but other levels of confidence could have just as easily 
been chosen depending on the need of the analyst. However, the level of 
confidence MUST be pre-set and not subject to revision as a result of the 
calculations. 


x= 10 
EBM=5 
X-EBM=5 
X + EBM = 15 


Confidence Level (CL) = 0.90 


x| 


5 10 15 


For this example, let's say we know that the actual population mean number 
of iTunes downloads is 2.1. The true population mean falls within the range 
of the 95% confidence interval. There is absolutely nothing to guarantee 
that this will happen. Further, if the true mean falls outside of the 
interval we will never know it. We must always remember that we will 
never ever know the true mean. Statistics simply allows us, with a given 
level of probability (confidence), to say that the true mean is within the 
range calculated. This is what was called in the introduction, the "level of 
ignorance admitted". 


Changing the Confidence Level or Sample Size 


Here again is the formula for a confidence interval for an unknown 
population mean assuming we know the population standard deviation: 


Equation: 
x- za(°/ ma) ae Xs. Za(°/ a) 


It is clear that the confidence interval is driven by two things, the chosen 
level of confidence, Z,, and the standard deviation of the sampling 


distribution. The Standard deviation of the sampling distribution is further 
affected by two things, the standard deviation of the population and the 
sample size we chose for our data. Here we wish to examine the effects of 
each of the choices we have made on the calculated confidence interval, the 
confidence level and the sample size. 


For a moment we should ask just what we desire in a confidence interval. 
Our goal was to estimate the population mean from a sample. We have 
forsaken the hope that we will ever find the true population mean, and 
population standard deviation for that matter, for any case except where we 
have an extremely small population and the cost of gathering the data of 
interest is very small. In all other cases we must rely on samples. With the 
Central Limit Theorem we have the tools to provide a meaningful 
confidence interval with a given level of confidence, meaning a known 
probability of being wrong. By meaningful confidence interval we mean 
one that is useful. Imagine that you are asked for a confidence interval for 
the ages of your classmates. You have taken a sample and find a mean of 
19.8 years. You wish to be very confident so you report an interval between 
9.8 years and 29.8 years. This interval would certainly contain the true 
population mean and have a very high confidence level. However, it hardly 
qualifies as meaningful. The very best confidence interval is narrow while 
having high confidence. There is a natural tension between these two goals. 
The higher the level of confidence the wider the confidence interval as the 
case of the students' ages above. We can see this tension in the equation for 
the confidence interval. 

Equation: 


ee 


The confidence interval will increase in width as Zaq@ increases, Za 
increases as the level of confidence increases. There is a tradeoff between 
the level of confidence and the width of the interval. Now let's look at the 
formula again and we see that the sample size also plays an important role 
in the width of the confidence interval. The sample sized, n, shows up in 
the denominator of the standard deviation of the sampling distribution. As 


the sample size increases, the standard deviation of the sampling 
distribution decreases and thus the width of the confidence interval, while 
holding constant the level of confidence. This relationship was 
demonstrated in [link]. Again we see the importance of having large 
samples for our analysis although we then face a second constraint, the cost 
of gathering data. 


Calculating the Confidence Interval: An Alternative Approach 


Another way to approach confidence intervals is through the use of 
something called the Error Bound. The Error Bound gets its name from the 
recognition that it provides the boundary of the interval derived from the 
standard error of the sampling distribution. In the equations above it is seen 
that the interval is simply the estimated mean, sample mean, plus or minus 
something. That something is the Error Bound and is driven by the 
probability we desire to maintain in our estimate, Z,, times the standard 
deviation of the sampling distribution. The Error Bound for a mean is given 
the name, Error Bound Mean, or EBM. 


To construct a confidence interval for a single unknown population mean p, 


where the population standard deviation is known, we need z as an 
estimate for p and we need the margin of error. Here, the margin of error 
(EBM) is called the error bound for a population mean (abbreviated EBM). 


The sample mean z is the point estimate of the unknown population mean 
Ll. 


The confidence interval estimate will have the form: 


(point estimate - error bound, point estimate + error bound) or, in symbols,( 
z-EBM,x+EBM) 


The mathematical formula for this confidence interval is: 


Equation: 
£-2( aa) seston (a 


The margin of error (EBM) depends on the confidence level (abbreviated 
CL). The confidence level is often considered the probability that the 
calculated confidence interval estimate will contain the true population 
parameter. However, it is more accurate to state that the confidence level is 
the percent of confidence intervals that contain the true population 
parameter when repeated samples are taken. Most often, it is the choice of 
the person constructing the confidence interval to choose a confidence level 
of 90% or higher because that person wants to be reasonably certain of his 
or her conclusions. 


There is another probability called alpha (a). a is related to the confidence 
level, CL. a is the probability that the interval does not contain the unknown 
population parameter. 

Mathematically, 1 - a= CL. 


A confidence interval for a population mean with a known standard 
deviation is based on the fact that the sampling distribution of the sample 
means follow an approximately normal distribution. Suppose that our 


sample has a mean of x = 10, and we have constructed the 90% confidence 
interval (5, 15) where EBM = 5. 


To get a 90% confidence interval, we must include the central 90% of the 
probability of the normal distribution. If we include the central 90%, we 
leave out a total of ~ = 10% in both tails, or 5% in each tail, of the normal 
distribution. 


x= 10 
EBM=5 
X—-EBM=5 
xX + EBM=15 


Confidence Level (CL) = 0.90 


x| 


To capture the central 90%, we must go out 1.645 standard deviations on 
either side of the calculated sample mean. The value 1.645 is the z-score 
from a standard normal probability distribution that puts an area of 0.90 in 
the center, an area of 0.05 in the far left tail, and an area of 0.05 in the far 
right tail. 


It is important that the standard deviation used must be appropriate for the 
parameter we are estimating, so in this section we need to use the standard 
deviation that applies to the sampling distribution for means which we 
studied with the Central Limit Theorem and is, ae 


Calculating the Confidence Interval Using EMB 


To construct a confidence interval estimate for an unknown population 
mean, we need data from a random sample. The steps to construct and 
interpret the confidence interval are: 


¢ Calculate the sample mean x from the sample data. Remember, in this 
section we know the population standard deviation o. 

e Find the z-score from the standard normal table that corresponds to the 
confidence level desired. 

e Calculate the error bound EBM. 

¢ Construct the confidence interval. 

e Write a sentence that interprets the estimate in the context of the 
situation in the problem. 


We will first examine each step in more detail, and then illustrate the 
process with some examples. 


Finding the z-score for the Stated Confidence Level 


When we know the population standard deviation o, we use a standard 
normal distribution to calculate the error bound EBM and construct the 
confidence interval. We need to find the value of z that puts an area equal to 


the confidence level (in decimal form) in the middle of the standard normal 
distribution Z ~ N(0, 1). 


The confidence level, CL, is the area in the middle of the standard normal 
distribution. CL = 1 — a, so a is the area that is split equally between the two 
tails. Each of the tails contains an area equal to +. 


The z-score that has an area to the right of + is denoted by Z . 


For example, when CL = 0.95, a = 0.05 and > = 0.025; we write Za = 
Z0.025- 


The area to the right of Zo.925 is 0.025 and the area to the left of Zo.925 is 1 — 
0.025 = 0.975. 


Z 2= Z0.025 = 1.96, using a standard normal probability table. We will see 


later that we can use a different probability table, the Student's t- 
distribution, for finding the number of standard deviations of commonly 
used levels of confidence. 


Calculating the Error Bound (EBM) 


The error bound formula for an unknown population mean p! when the 
population standard deviation o is known is 


° EBM = (Za) (=) 


Constructing the Confidence Interval 


e The confidence interval estimate has the format 
(2- EBM, z+ EBM) or the formula: 


X- Z4(°/va) Apis X+ Za(°/va) 


The graph gives a picture of the entire situation. 


CL:-+ 


bo|Q 


2 Se Cha = 2 


CL=1-a 


xX— EBM x X+EBM 


Example: 
Suppose we are interested in the mean scores on an exam. A random 
sample of 36 scores is taken and gives a sample mean (sample mean score) 


of 68 (X = 68). In this example we have the unusual knowledge that the 
population standard deviation is 3 points. Do not count on knowing the 
population parameters outside of textbook examples. Find a confidence 
interval estimate for the population mean exam score (the mean score on 
all exams). 

Exercise: 


Problem: 


Find a 90% confidence interval for the true (population) mean of 
Statistics exam scores. 


Solution: 


e The solution is shown step-by-step. 


To find the confidence interval, you need the sample mean, x, and the 
EBM. 


e 7 =68 


¢ EBM = (Zz) (&) 
e ¢ =3; n= 36; The confidence level is 90% (CL = 0.90) 


CL = 0.90 so a= 1—CL=1-—0.90 = 0.10 
mr = 0.05 ie = 2005 


The area to the right of Zo 95 is 0.05 and the area to the left of Zo 05 is 
1—0.05 = 0.95. 


Ze = Zoo5 = 1.645 


This can be found using a computer, or using a probability table for 
the standard normal distribution. Because the common levels of 
confidence in the social sciences are 90%, 95% and 99% it will not be 
long until you become familiar with the numbers , 1.645, 1.96, and 
2.56 


EBM = (1.645)( 2 ) = 0.8225 


xz - EBM = 68 - 0.8225 = 67.1775 


x + EBM = 68 + 0.8225 = 68.8225 
The 90% confidence interval is (67.1775, 68.8225). 


Interpretation 
We estimate with 90% confidence that the true population mean exam 
score for all statistics students is between 67.18 and 68.82. 


Example: 
Exercise: 


Problem: 
Suppose we change the original problem in [link] by using a 95% 


confidence level. Find a 95% confidence interval for the true 
(population) mean statistics exam score. 


Solution: 


= -1.96 0 Z aon = 1.96 


~2 95 025 


Equation: 


Equation: 


3 
V/'36 


Equation: 


67.02 < pp < 68.98 


0 = 3; n = 36; The confidence level is 95% (CL = 0.95). 
CL =0.95 so~w=1-—-CL=1-0.95 = 0.05 
Ze = Zo.025 = 1.96 


Notice that the EBM is larger for a 95% confidence level in the 
original problem. 


Comparing the results 

The 90% confidence interval is (67.18, 68.82). The 95% confidence 
interval is (67.02, 68.98). The 95% confidence interval is wider. If 
you look at the graphs, because the area 0.95 is larger than the area 
0.90, it makes sense that the 95% confidence interval is wider. To be 
more confident that the confidence interval actually does contain the 
true value of the population mean for all statistics exam scores, the 
confidence interval necessarily needs to be wider. This demonstrates a 
very important principle of confidence intervals. There is a trade off 
between the level of confidence and the width of the interval. Our 
desire is to have a narrow confidence interval, huge wide intervals 
provide little information that is useful. But we would also like to 
have a high level of confidence in our interval. This demonstrates that 


we cannot have both. 
0.95 


0.025 0.025 


(b) 


Summary: Effect of Changing the Confidence Level 


e Increasing the confidence level makes the confidence interval 
wider. 

e Decreasing the confidence level makes the confidence interval 
narrower. 


And again here is the formula for a confidence interval for an unknown 
mean assuming we have the population standard deviation: 


Equation: 
x- 24(°/ _] <usX+2(°/ 7] 


The standard deviation of the sampling distribution was provided by the 
Central Limit Theorem as 2) JVn- While we infrequently get to choose the 
sample size it plays an important role in the confidence interval. Because 
the sample size is in the denominator of the equation, as n increases it 
causes the standard deviation of the sampling distribution to idecrease and 
thus the width of the confidence interval to decrease. We have met this 
before as we reviewed the effects of sample size on the Central Limit 
Theorem. There we saw that as 7 increases the sampling distribution 
narrows until in the limit it collapses on the true population mean. 


Example: 

Suppose we change the original problem in [link] to see what happens to 
the confidence interval if the sample size is changed. 

Exercise: 


Problem: 
Leave everything the same except the sample size. Use the original 
90% confidence level. What happens to the confidence interval if we 


increase the sample size and use n = 100 instead of n = 36? What 
happens if we decrease the sample size to n = 25 instead of n = 36? 


Solution: 
Solution A 
(oe Za( =) 


_ 3 
w= 68+ 1.645 (3) 


67.5065 < pw < 68.4935 

If we increase the sample size n to 100, we decrease the width of the 
confidence interval relative to the original sample size of 36 
observations. 


Solution: 


Solution B 

p=atZa( 5) 

w= 68+ 1.645 (4 ) 

67.013 < pp < 68.987 

If we decrease the sample size n to 25, we increase the width of the 


confidence interval by comparison to the original sample size of 36 
observations. 


Summary: Effect of Changing the Sample Size 


e Increasing the sample size makes the confidence interval narrower. 
e Decreasing the sample size makes the confidence interval wider. 


We have already seen this effect when we reviewed the effects of changing 
the size of the sample, n, on the Central Limit Theorem. See [link] to see 
this effect. Before we saw that as the sample size increased the standard 
deviation of the sampling distribution decreases. This was why we choose 
the sample mean from a large sample as compared to a small sample, all 
other things held constant. 


Thus far we assumed that we knew the population standard deviation. This 
will virtually never be the case. We will have the sample standard deviation, 
s, however. This is a point estimate for the population standard deviation 
and can be substituted into the formula for confidence intervals for a mean 
under certain circumstances. We just saw the effect the sample size has on 
the width of confidence interval and the impact on the sampling distribution 
for our discussion of the Central Limit Theorem. We can invoke this to 


substitute the point estimate for the standard deviation if the sample size is 
large "enough". Simulation studies indicate that 30 observations or more 
will be sufficient to eliminate any meaningful bias in the estimated 
confidence interval. 


Example: 

Spring break can be a very expensive holiday. A sample of 80 students is 
surveyed, and the average amount spent by students on travel and 
beverages is $593.84. The sample standard deviation is approximately 
$369.34. 

Exercise: 


Problem: 


Construct a 92% confidence interval for the population mean amount 
of money spent by spring breakers. 


Solution: 


We begin with the confidence interval for a mean. We use the formula 
for a mean because the random variable is dollars spent and this is a 
continuous random variable. The point estimate for the population 
standard deviation, s, has been substituted for the true population 
standard deviation because with 80 observations there is no concern 
for bias in the estimate of the confidence interval. 

Equation: 


8 
wae Zi a 


Substituting the values into the formula, we have: 
Equation: 


pu = 593.84 + 1.75 
80 


369.34 | 


Z(q/2) is found on the standard normal table by looking up 0.46 in the 
body of the table and finding the number of standard deviations on the 
side and top of the table; 1.75. The solution for the interval is thus: 
Equation: 


pe = 593.84 + 72.2636 = (521.57, 666.10) 
Equation: 


$521.58 < pw < $666.10 


x! 


$521.58 $593.84 $666.10 
| \ 
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Formula Review 


The general form for a confidence interval for a single population mean, 
known standard deviation, normal distribution is given by 


x — Zo (*/ i. 4 = x + Zo (*/ “) This formula is used when the 


population standard deviation is known. 


CL = confidence level, or the proportion of confidence intervals created that 
are expected to contain the true population parameter 


a = 1—CL = the proportion of confidence intervals that will not contain the 
population parameter 


za = the z-score with the property that the area to the right of the z-score is 
oe this is the z-score used in the calculation of "EBM where a = 1 — CL. 


Glossary 


Confidence Level (CL) 
the percent expression for the probability that the confidence interval 
contains the true population parameter; for example, if the CL = 90%, 
then in 90 out of 100 samples the interval estimate will enclose the true 
population parameter. 


Error Bound for a Population Mean (EBM) 
the margin of error; depends on the confidence level, sample size, and 
known or estimated population standard deviation. 


A Confidence Interval for a Population Standard Deviation Unknown, Small Sample Case 


In practice, we rarely know the population standard deviation. In the past, when the sample 
size was large, this did not present a problem to statisticians. They used the sample standard 
deviation s as an estimate for o and proceeded as before to calculate a confidence interval 
with close enough results. This is what we did in [link] above. The point estimate for the 
standard deviation, s, was substituted in the formula for the confidence interval for the 
population standard deviation. In this case there 80 observation well above the suggested 30 
observations to eliminate any bias from a small sample. However, statisticians ran into 
problems when the sample size was small. A small sample size caused inaccuracies in the 
confidence interval. 


William S. Goset (1876-1937) of the Guinness brewery in Dublin, Ireland ran into this 
problem. His experiments with hops and barley produced very few samples. Just replacing o 
with s did not produce accurate results when he tried to calculate a confidence interval. He 
realized that he could not use a normal distribution for the calculation; he found that the actual 
distribution depends on the sample size. This problem led him to "discover" what is called the 
Student's t-distribution. The name comes from the fact that Gosset wrote under the pen name 
"A Student.” 


Up until the mid-1970s, some statisticians used the normal distribution approximation for 
large sample sizes and used the Student's t-distribution only for sample sizes of at most 30 
observations. 


If you draw a simple random sample of size n from a population with mean p and unknown 
r—p 
(7) 


Student's t-distribution with n — 1 degrees of freedom. The t-score has the same 


population standard deviation o and calculate the t-score t = , then the t-scores follow a 


interpretation as the z-score. It measures how far in standard deviation units x is from its mean 
pt. For each sample size n, there is a different Student's t-distribution. 


The degrees of freedom, n — 1, come from the calculation of the sample standard deviation s. 
Remember when we first calculated a sample standard deviation we divided the sum of the 


squared deviations by n — 1, but we used n deviations (x— values) to calculate s. Because the 
sum of the deviations is zero, we can find the last deviation once we know the other n- 1 
deviations. The other n — 1 deviations can change or vary freely. We call the number n — 1 
the degrees of freedom (df) in recognition that one is lost in the calculations. The effect of 
losing a degree of freedom is that the t-value increases and the confidence interval increases in 
width. 

Properties of the Student's t-Distribution 


e The graph for the Student's t-distribution is similar to the standard normal curve and at 
infinite degrees of freedom it is the normal distribution. You can confirm this by reading 
the bottom line at infinite degrees of freedom for a familiar level of confidence, e.g. at 
column 0.05, 95% level of confidence, we find the t-value of 1.96 at infinite degrees of 
freedom. 


e The mean for the Student's t-distribution is zero and the distribution is symmetric about 
zero, again like the standard normal distribution. 

e The Student's t-distribution has more probability in its tails than the standard normal 
distribution because the spread of the t-distribution is greater than the spread of the 
standard normal. So the graph of the Student's t-distribution will be thicker in the tails and 
shorter in the center than the graph of the standard normal distribution. 

e The exact shape of the Student's t-distribution depends on the degrees of freedom. As the 
degrees of freedom increases, the graph of Student's t-distribution becomes more like the 
graph of the standard normal distribution. 

e The underlying population of individual observations is assumed to be normally 
distributed with unknown population mean p and unknown population standard deviation 
o. This assumption comes from the Central Limit theorem because the individual 
observations in this case are the xs of the sampling distribution. The size of the 
underlying population is generally not relevant unless it is very small. If it is normal then 
the assumption is met and doesn't need discussion. 


A probability table for the Student's t-distribution is used to calculate t-values at various 
commonly-used levels of confidence. The table gives t-scores that correspond to the 
confidence level (column) and degrees of freedom (row). When using a t-table, note that some 
tables are formatted to show the confidence level in the column headings, while the column 
headings in some tables may show only corresponding area in one or both tails. Notice that at 
the bottom the table will show the t-value for infinite degrees of freedom. Mathematically, as 
the degrees of freedom increase, the t distribution approaches the standard normal distribution. 
You can find familiar Z-values by looking in the relevant alpha column and reading value in 
the last row. 


A Student's t table (See [link]) gives t-scores given the degrees of freedom and the right-tailed 
probability. 


The Student's t distribution has one of the most desirable properties of the normal: it is 
symmetrical. What the Student's t distribution does is spread out the horizontal axis so it takes 
a larger number of standard deviations to capture the same amount of probability. In reality 
there are an infinite number of Student's t distributions, one for each adjustment to the sample 
size. As the sample size increases, the Student's t distribution become more and more like the 
normal distribution. When the sample size reaches 30 the normal distribution is usually 
substituted for the Student's t because they are so much alike. This relationship between the 
Student's t distribution and the normal distribution is shown in [link]. 


I 
| — Normal Distribution 


This is another example of one distribution limiting another one, in this case the normal 
distribution is the limiting distribution of the Student's t when the degrees of freedom in the 
Student's t approaches infinity. This conclusion comes directly from the derivation of the 
Student's t distribution by Mr. Gosset. He recognized the problem as having few observations 
and no estimate of the population standard deviation. He was substituting the sample standard 
deviation and getting volatile results. He therefore created the Student's t distribution as a ratio 
of the normal distribution and Chi squared distribution. The Chi squared distribution is itself a 
ratio of two variances, in this case the sample variance and the unknown population variance. 
The Student's t distribution thus is tied to the normal distribution, but has degrees of freedom 
that come from those of the Chi squared distribution. The algebraic solution demonstrates this 
result. 

Development of Student's t-distribution: 


z 
jz 
Uv 


Where Z is the standard normal distribution and y? is the chi-squared distribution with v 


degrees of freedom. 
(w=n) 


21S = 


Lt= 


by substitution, and thus Student's t with v = n — 1 degrees of freedom is: 
3.t=—5 


Restating the formula for a confidence interval for the mean for cases when the sample size is 
smaller than 30 and we do not know the population standard deviation, o: 


Equation: 
=~ tal Fe) SSF tal Te) 
L — ty, = fa — 2 Vv, = 
yn)" “vn 


Here the point estimate of the population standard deviation, s has been substituted for the 
population standard deviation, o, and t,,a has been substituted for Z,. The Greek letter v 


(pronounced nu) is placed in the general formula in recognition that there are many Student t, 
distributions, one for each sample size. v is the symbol for the degrees of freedom of the 
distribution and depends on the size of the sample. Often df is used to abbreviate degrees of 
freedom. For this type of problem, the degrees of freedom is v = n-1, where n is the sample 
size. To look up a probability in the Student's t table we have to know the degrees of freedom 
in the problem. 


Example: 
Exercise: 


Problem: 


The average earnings per share (EPS) for 10 industrial stocks randomly selected from 


those listed on the Dow-Jones Industrial Average was found to be X = 1.85 witha 
standard deviation of s=0.395. Calculate a 99% confidence interval for the average EPS 
of all the industrials listed on the DJIA. 

Equation: 


Solution: 


To help visualize the process of calculating a confident interval we draw the appropriate 
distribution for the problem. In this case this is the Student’s t because we do not know 
the population standard deviation and the sample is small, less than 30. 


1.44 X= 1.85 2.26 


To find the appropriate t-value requires two pieces of information, the level of 
confidence desired and the degrees of freedom. The question asked for a 99% confidence 
level. On the graph this is shown where (1-q) , the level of confidence , is in the 
unshaded area. The tails, thus, have .005 probability each, a/2. The degrees of freedom 
for this type of problem is n-1= 9. From the Student’s t table, at the row marked 9 and 
column marked .005, is the number of standard deviations to capture 99% of the 
probability, 3.2498. These are then placed on the graph remembering that the Student’s t 
is symmetrical and so the t-value is both plus or minus on each side of the mean. 


Inserting these values into the formula gives the result. These values can be placed on the 


graph to see the relationship between the distribution of the sample means, X's and the 
Student’s t distribution. 
Equation: 


- 0.395 
y= X + te/o,at-n1—— = 1.851 + 3.2498 ——— = 1.8551 + 0.406 
vn V10 


Equation: 


1.445 < p < 2.257 


We state the formal conclusion as : 


With 99% confidence level, the average EPS of all the industries listed at DJIA is from 
$1.44 to $2.26. 


Note: 
Try It 
Exercise: 


Problem: 


You do a study of hypnotherapy to determine how effective it is in increasing the number 
of hours of sleep subjects get each night. You measure hours of sleep for 12 subjects with 
the following results. Construct a 95% confidence interval for the mean number of hours 
slept for the population (assumed normal) from which you took the data. 


B28 Silo 772 Bee Ge Wil we IMO) ile S/S Eye ©) ve 7 5p IOS 
Solution: 


(8.1634, 9.8032) 
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Chapter Review 


In many cases, the researcher does not know the population standard deviation, o, of the 
measure being studied. In these cases, it is common to use the sample standard deviation, s, as 
an estimate of o. The normal distribution creates accurate confidence intervals when o is 
known, but it is not as accurate when s is used as an estimate. In this case, the Student’s t- 
distribution is much better. Define a t-score using the following formula: 


t= =# 
In 


The t-score follows the Student’s t-distribution with n— 1 degrees of freedom. The confidence 
s 
Jn 
area to the right equal to 5-, s is the sample standard deviation, and n is the sample size. Use a 

table, calculator, or computer to find ¢2 for a given a. 


interval under this distribution is calculated with x + (t=) where ts is the t-score with 


Formula Review 


s = the standard deviation of sample values. 


t = + is the formula for the t-score which measures how far away a measure is from the 


vin 
population mean in the Student’s t-distribution 


df =n - 1; the degrees of freedom for a Student’s t-distribution where n represents the size of 
the sample 


T~tgp the random variable, T, has a Student’s t-distribution with df degrees of freedom 


The general form for a confidence interval for a single mean, population standard deviation 
unknown, and sample size less than 30 Student's t is given by: 


Z — tye $2) SS 2+ teal) 


Use the following information to answer the next five exercises. A hospital is trying to cut 
down on emergency room wait times. It is interested in the amount of time patients must wait 
before being called back to be examined. An investigation committee randomly surveyed 70 
patients. The sample mean was 1.5 hours with a sample standard deviation of 0.5 hours. 
Exercise: 


Problem: Identify the following: 


Exercise: 


Problem: Define the random variables X and xX in words. 


Solution: 


X is the number of hours a patient waits in the emergency room before being called back 
to be examined. X is the mean wait time of 70 patients in the emergency room. 


Exercise: 


Problem: Which distribution should you use for this problem? 
Exercise: 


Problem: 


Construct a 95% confidence interval for the population mean time spent waiting. State the 
confidence interval, sketch the graph, and calculate the error bound. 


Solution: 


CI: (1.3808, 1.6192) 


0.95 


EBM = 0.12 


Exercise: 

Problem: Explain in complete sentences what the confidence interval means. 
Use the following information to answer the next six exercises: One hundred eight Americans 
were surveyed to determine the number of hours they spend watching television each month. It 
was revealed that they watched an average of 151 hours each month with a standard deviation 


of 32 hours. Assume that the underlying population distribution is normal. 
Exercise: 


Problem: Identify the following: 


Solution: 
a, = 151 
b. sz = 32 
c.n=108 
d.n—1=107 


Exercise: 


Problem: Define the random variable X in words. 


Exercise: 


Problem: Define the random variable xX in words. 


Solution: 


X is the mean number of hours spent watching television per month from a sample of 
108 Americans. 


Exercise: 


Problem: Which distribution should you use for this problem? 
Exercise: 
Problem: 
Construct a 99% confidence interval for the population mean hours spent watching 


television per month. (a) State the confidence interval, (b) sketch the graph, and (c) 
calculate the error bound. 


Solution: 


CI: (142.92, 159.08) 
0.99 


142.92 151 159.08 


EBM = 8.08 
Exercise: 


Problem: 


Why would the error bound change if the confidence level were lowered to 95%? 


Use the following information to answer the next 13 exercises: The data in [link] are the result 
of arandom survey of 39 national flags (with replacement between picks) from various 
countries. We are interested in finding a confidence interval for the true mean number of colors 
on a national flag. Let X = the number of colors on a national flag. 


X Freq. 


1 1 

2 7 

3 18 

4 7 

5 6 
Exercise: 


Problem: Calculate the following: 


aL = 


b. sz = 
cn= 


Solution: 
a. 3.26 


b. 1.02 
c. 39 


Exercise: 


Problem: Define the random variable x in words. 


Exercise: 
Problem: What is x estimating? 


Solution: 


iv 
Exercise: 


Problem: Is ao, known? 


Exercise: 


Problem: 


As a result of your answer to [link], state the exact distribution to use when calculating 
the confidence interval. 


Solution: 


38 


Construct a 95% confidence interval for the true mean number of colors on national flags. 
Exercise: 


Problem: How much area is in both tails (combined)? 


Exercise: 


Problem: How much area is in each tail? 


Solution: 
0.025 
Exercise: 
Problem: Calculate the following: 


a. lower limit 
b. upper limit 
c. error bound 


Exercise: 


Problem: The 95% confidence interval is 


Solution: 


(2.93, 3.59) 
Exercise: 


Problem: 


Fill in the blanks on the graph with the areas, the upper and lower limits of the 
Confidence Interval and the sample mean. 


Exercise: 


Problem: In one complete sentence, explain what the interval means. 
Solution: 


We are 95% confident that the true mean number of colors for national flags is between 
2.93 colors and 3.59 colors. 
Exercise: 


Problem: 


Using the same 2, sz, and level of confidence, suppose that n were 69 instead of 39. 
Would the error bound become larger or smaller? How do you know? 


Solution: 


The error bound would become EBM = 0.245. This error bound decreases because as 
sample sizes increase, variability decreases and we need less interval length to capture the 
true mean. 


Exercise: 


Problem: 


Using the same z, sz, and n = 39, how would the error bound change if the confidence 
level were reduced to 90%? Why? 


Homework 


Exercise: 


Problem: 


In six packages of “The Flintstones® Real Fruit Snacks” there were five Bam-Bam snack 
pieces. The total number of snack pieces in the six bags was 68. We wish to calculate a 
96% confidence interval for the population proportion of Bam-Bam snack pieces. 


a. Define the random variables X and P’ in words. 


b. Which distribution should you use for this problem? Explain your choice 

c. Calculate p’. 

d. Construct a 96% confidence interval for the population proportion of Bam-Bam 
snack pieces per bag. 


i. State the confidence interval. 
ii. Sketch the graph. 
iii. Calculate the error bound. 


e. Do you think that six packages of fruit snacks yield enough data to give accurate 
results? Why or why not? 


Exercise: 


Problem: 


A random survey of enrollment at 35 community colleges across the United States 
yielded the following figures: 6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 
2,825; 2,044; 5,481; 5,200; 5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200; 
17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768; 7,493; 2,771; 2,861; 1,263; 7,285; 
28,165; 5,080; 11,622. Assume the underlying population is normal. 


ac ie = 
ll. Sy = 
ili. n= 
iv.n-1= 


b. Define the random variables X and X in words. 

c. Which distribution should you use for this problem? Explain your choice. 

d. Construct a 95% confidence interval for the population mean enrollment at 
community colleges in the United States. 


i. State the confidence interval. 
ii. Sketch the graph. 


e. What will happen to the error bound and confidence interval if 500 community 
colleges were surveyed? Why? 


Solution: 
a. i. 8629 
ii. 6944 
iu. 35 
iv. 34 


b. t34 


c. i. CI: (6244, 11,014) 


6244 8629 11014 


il. 
d. It will become smaller 


Exercise: 


Problem: 


Suppose that a committee is studying whether or not there is waste of time in our judicial 
system. It is interested in the mean amount of time individuals waste at the courthouse 
waiting to be called for jury duty. The committee randomly surveyed 81 people who 
recently served as jurors. The sample mean wait time was eight hours with a sample 
standard deviation of four hours. 


a ize 
ll. Sz = 
ili. n= 
iv.n-1= 


b. Define the random variables X and X in words. 
c. Which distribution should you use for this problem? Explain your choice. 
d. Construct a 95% confidence interval for the population mean time wasted. 


i. State the confidence interval. 
ii. Sketch the graph. 


e. Explain in a complete sentence what the confidence interval means. 


Exercise: 


Problem: 


A pharmaceutical company makes tranquilizers. It is assumed that the distribution for the 
length of time they last is approximately normal. Researchers in a hospital used the drug 
on a random sample of nine patients. The effective period of the tranquilizer for each 
patient (in hours) was as follows: 2.7; 2.8; 3.0; 2.3; 2.3; 2.2; 2.8; 2.1; and 2.4. 


a Lee 
il. Sz = 
iii. n= 
iv.n-1= 


b. Define the random variable X in words. 


c. Define the random variable X in words. 
d. Which distribution should you use for this problem? Explain your choice. 
e. Construct a 95% confidence interval for the population mean length of time. 


i. State the confidence interval. 
ii. Sketch the graph. 


f. What does it mean to be “95% confident” in this problem? 


Solution: 
a iLv=2.51 
ii. s, = 0.318 
iii. n=9 
iv.n-1=8 


b. the effective length of time for a tranquilizer 

c. the mean effective length of time of tranquilizers from a sample of nine patients 

d. We need to use a Student’s-t distribution, because we do not know the population 
standard deviation. 


e. i. Cl: (2.27, 2.76) 
ii. Check student's solution. 


f. If we were to sample many groups of nine patients, 95% of the samples would 
contain the true population mean length of time. 


Exercise: 
Problem: 
Suppose that 14 children, who were learning to ride two-wheel bikes, were surveyed to 
determine how long they had to use training wheels. It was revealed that they used them 


an average of six months with a sample standard deviation of three months. Assume that 
the underlying population distribution is normal. 


a. i. z= 


iv.n-l= 


b. Define the random variable X in words. 


c. Define the random variableX in words. 

d. Which distribution should you use for this problem? Explain your choice. 

e. Construct a 99% confidence interval for the population mean length of time using 
training wheels. 


i. State the confidence interval. 
ii. Sketch the graph. 


f. Why would the error bound change if the confidence level were lowered to 90%? 


Exercise: 


Problem: 


The Federal Election Commission (FEC) collects information about campaign 
contributions and disbursements for candidates and political committees each election 
cycle. A political action committee (PAC) is a committee formed to raise money for 
candidates and campaigns. A Leadership PAC is a PAC formed by a federal politician 
(senator or representative) to raise money to help other candidates’ campaigns. 


The FEC has reported financial information for 556 Leadership PACs that operating 
during the 2011—2012 election cycle. The following table shows the total receipts during 
this cycle for a random selection of 30 Leadership PACs. 


$46,500.00 $0 $40,966.50 $105,887.20 $5,175.00 
$29,050.00 $19,500.00 $181,557.20 $31,500.00 $149,970.80 
$2,555,363.20 $12,025.00 $409,000.00 $60,521.70 $18,000.00 
$61,810.20 $76,530.80 $119,459.20 $0 $63,520.00 
$6,500.00 $502,578.00 $705,061.10 $708,258.90 $135,810.00 
$2,000.00 $2,000.00 $0 $1,287,933.80 $219,148.30 


x = $251, 854.23 


s = $521,130.41 


Use this sample data to construct a 95% confidence interval for the mean amount of 
money raised by all Leadership PACs during the 2011-2012 election cycle. Use the 
Student's t-distribution. 


Solution: 


@ = $251,854.23 
s = $521,130.41 


Note that we are not given the population standard deviation, only the standard deviation 
of the sample. 


There are 30 measures in the sample, so n = 30, and df= 30 - 1 = 29 
CL = 0.96, soa=1-CL=1-0.96 = 0.04 


> = 0.02¢2 = to.02 = 2.150 


EBM = ts (+) = 2.150 (24804) - $204, 561.66 


z - EBM = $251,854.23 - $204,561.66 = $47,292.57 


z + EBM = $251,854.23+ $204,561.66 = $456,415.89 


We estimate with 96% confidence that the mean amount of money raised by all 
Leadership PACs during the 2011—2012 election cycle lies between $47,292.57 and 
$456,415.89. 


Exercise: 
Problem: 
Forbes magazine published data on the best small firms in 2012. These were firms that 
had been publicly traded for at least a year, have a stock price of at least $5 per share, and 


have reported annual revenue between $5 million and $1 billion. The [link] shows the 
ages of the corporate CEOs for a random sample of these firms. 


48 58 o1 61 56 


59 74 63 53 50 


59 60 60 57 46 
59 63 57 47 55 
57 43 61 62 49 
67 67 55 55 49 


Use this sample data to construct a 90% confidence interval for the mean age of CEO’s 
for these top small firms. Use the Student's t-distribution. 


Exercise: 


Problem: 


Unoccupied seats on flights cause airlines to lose revenue. Suppose a large airline wants 
to estimate its mean number of unoccupied seats per flight over the past year. To 
accomplish this, the records of 225 flights are randomly selected and the number of 
unoccupied seats is noted for each of the sampled flights. The sample mean is 11.6 seats 
and the sample standard deviation is 4.1 seats. 


a- ies 
ll. Sy = 
ili. n= 
iv. n-1 = 


b. Define the random variables X and X in words. 

c. Which distribution should you use for this problem? Explain your choice. 

d. Construct a 92% confidence interval for the population mean number of unoccupied 
seats per flight. 


i. State the confidence interval. 
ii. Sketch the graph. 


Solution: 
Ax “ie p= AG 
lis, = 4.1 
li, n= 225 
iv.n-1=224 


b. X is the number of unoccupied seats on a single flight. X is the mean number of 
unoccupied seats from a sample of 225 flights. 


c. We will use a Student’s-t distribution, because we do not know the population 
standard deviation. 


ds. TCR (112. ; 12:08) 
ii. Check student's solution. 


Exercise: 
Problem: 
In a recent sample of 84 used car sales costs, the sample mean was $6,425 with a standard 
deviation of $3,156. Assume the underlying distribution is approximately normal. 
a. Which distribution should you use for this problem? Explain your choice. 
b. Define the random variable X in words. 


c. Construct a 95% confidence interval for the population mean cost of a used car. 


i. State the confidence interval. 
ii. Sketch the graph. 


d. Explain what a “95% confidence interval” means for this study. 


Exercise: 


Problem: 


Six different national brands of chocolate chip cookies were randomly selected at the 
supermarket. The grams of fat per serving are as follows: 8; 8; 10; 7; 9; 9. Assume the 
underlying distribution is approximately normal. 


a. Construct a 90% confidence interval for the population mean grams of fat per 
serving of chocolate chip cookies sold in supermarkets. 


i. State the confidence interval. 
ii. Sketch the graph. 


b. If you wanted a smaller error bound while keeping the same level of confidence, 
what should have been changed in the study before it was done? 

c. Go to the store and record the grams of fat per serving of six brands of chocolate 
chip cookies. 

d. Calculate the mean. 


e. Is the mean within the interval you calculated in part a? Did you expect it to be? 
Why or why not? 


Solution: 


a_i. CI: (7.64, 9.36) 


7.64 8.5 9.36 


i. 


b. The sample should have been increased. 
c. Answers will vary. 
d. Answers will vary. 
e. Answers will vary. 


Exercise: 


Problem: 


A survey of the mean number of cents off that coupons give was conducted by randomly 
surveying one coupon per page from the coupon sections of a recent San Jose Mercury 
News. The following data were collected: 20¢; 75¢; 50¢; 65¢; 30¢; 55¢; 40¢; 40¢; 30¢; 
55¢; $1.50; 40¢; 65¢; 40¢. Assume the underlying distribution is approximately normal. 


a iLie= 
ll. Sz = 
ili. n= 
iv. n-1 = 


b. Define the random variables X and X in words. 
c. Which distribution should you use for this problem? Explain your choice. 
d. Construct a 95% confidence interval for the population mean worth of coupons. 


i. State the confidence interval. 
ii. Sketch the graph. 


e. If many random samples were taken of size 14, what percent of the confidence 
intervals constructed should contain the population mean worth of coupons? Explain 
why. 


Use the following information to answer the next two exercises: A quality control specialist for 
a restaurant chain takes a random sample of size 12 to check the amount of soda served in the 
16 oz. serving size. The sample mean is 13.30 with a sample standard deviation of 1.55. 
Assume the underlying population is normally distributed. 

Exercise: 


Problem: 


Find the 95% Confidence Interval for the true population mean for the amount of soda 
served. 


a. (12.42, 14.18) 
b. (12.32, 14.29) 
c. (12.50, 14.10) 
d. Impossible to determine 


Solution: 


b 


Glossary 


Degrees of Freedom (df) 
the number of objects in a sample that are free to vary 


Normal Distribution 
1 


a continuous random variable (RV) with pdf f(z) = WE e-(t-#)"/20 where pis the 


mean of the distribution and oa is the standard deviation, notation: X ~ N(p,0). If uy = 0 and 
o = 1, the RV is called the standard normal distribution. 


Standard Deviation 
a number that is equal to the square root of the variance and measures how far data values 
are from their mean; notation: s for sample standard deviation and o for population 
standard deviation 


Student's t-Distribution 
investigated and reported by William S. Gossett in 1908 and published under the 
pseudonym Student; the major characteristics of this random variable (RV) are: 


e It is continuous and assumes any real values. 

e The pdf is symmetrical about its mean of zero. 

e It approaches the standard normal distribution as n get larger. 

e There is a "family of t—distributions: each representative of the family is completely 
defined by the number of degrees of freedom, which depends upon the application 
for which the t is being used. 


A Confidence Interval for A Population Proportion 


During an election year, we see articles in the newspaper that state confidence intervals in terms of 
proportions or percentages. For example, a poll for a particular candidate running for president might 
show that the candidate has 40% of the vote within three percentage points (if the sample is large 
enough). Often, election polls are calculated with 95% confidence, so, the pollsters would be 95% 
confident that the true proportion of voters who favored the candidate would be between 0.37 and 
0.43. 


Investors in the stock market are interested in the true proportion of stocks that go up and down each 
week. Businesses that sell personal computers are interested in the proportion of households in the 
United States that own personal computers. Confidence intervals can be calculated for the true 
proportion of stocks that go up or down each week and for the true proportion of households in the 
United States that own personal computers. 


The procedure to find the confidence interval for a population proportion is similar to that for the 
population mean, but the formulas are a bit different although conceptually identical. While the 
formulas are different, they are based upon the same mathematical foundation given to us by the 
Central Limit Theorem. Because of this we will see the same basic format using the same three pieces 
of information: the sample value of the parameter in question, the standard deviation of the relevant 
sampling distribution, and the number of standard deviations we need to have the confidence in our 
estimate that we desire. 


How do you know you are dealing with a proportion problem? First, the underlying distribution 
has a binary random variable and therefore is a binomial distribution. (There is no mention of a 
mean or average.) If X is a binomial random variable, then X ~ B(n, p) where n is the number of trials 
and p is the probability of a success. To form a sample proportion, take X, the random variable for the 
number of successes and divide it by n, the number of trials (or the sample size). The random variable 
P’ (read "P prime") is the sample proportion, 


pr=x 


n 
(Sometimes the random variable is denoted as P, read "P hat".) 


p' = the estimated proportion of successes or sample proportion of successes (p’ is a point estimate 
for p, the true population proportion, and thus q is the probability of a failure in any one trial.) 


x = the number of successes in the sample 
n= the size of the sample 
The formula for the confidence interval for a population proportion follows the same format as that for 


an estimate of a population mean. Remembering the sampling distribution for the proportion from 
Chapter 7, the standard deviation was found to be: 


Equation: 
| pA —p) 
On = 
s n 


The confidence interval for a population proportion, therefore, becomes: 


Equation: 


4(1— pl 
p=plt Zpy BO) 
") 


Z(2) is set according to our desired degree of confidence and f ae ite 


is the standard deviation of 
the sampling distribution. 


The sample proportions p’ and q’ are estimates of the unknown population proportions p and q. 
The estimated proportions p’ and q' are used because p and q are not known. 


Remember that as p moves further from 0.5 the binomial distribution becomes less symmetrical. 
Because we are estimating the binomial with the symmetrical normal distribution the further away 
from symmetrical the binomial becomes the less confidence we have in the estimate. 


This conclusion can be demonstrated through the following analysis. Proportions are based upon the 
binomial probability distribution. The possible outcomes are binary, either “success” or “failure”. This 
gives rise to a proportion, meaning the percentage of the outcomes that are “successes”. It was shown 
that the binomial distribution could be fully understood if we knew only the probability of a success in 
any one trial, called p. The mean and the standard deviation of the binomial were found to be: 
Equation: 


= np 


Equation: 


o=/npq 


It was also shown that the binomial could be estimated by the normal distribution if BOTH np AND 
nq were greater than 5. From the discussion above, it was found that the standardizing formula for the 
binomial distribution is: 

Equation: 


which is nothing more than a restatement of the general standardizing formula with appropriate 
substitutions for 1: and o from the binomial. We can use the standard normal distribution, the reason Z 
is in the equation, because the normal distribution is the limiting distribution of the binomial. This is 
another example of the Central Limit Theorem. We have already seen that the sampling distribution of 
means is normally distributed. Recall the extended discussion in Chapter 7 concerning the sampling 
distribution of proportions and the conclusions of the Central Limit Theorem. 


We can now manipulate this formula in just the same way we did for finding the confidence intervals 
for a mean, but to find the confidence interval for the binomial population parameter, p. 
Equation: 


a 


VA? Lowe) 
p’— Za) #9 <p<pi+ Zu PS 
n 


Where p’ = x/n, the point estimate of p taken from the sample. Notice that p’ has replaced p in the 
formula. This is because we do not know p, indeed, this is just what we are trying to estimate. 


Unfortunately, there is no correction factor for cases where the sample size is small so np’ and nq' must 
always be greater than 5 to develop an interval estimate for p. 


Example: 
Exercise: 


Problem: 


Suppose that a market research firm is hired to estimate the percent of adults living in a large city 
who have cell phones. Five hundred randomly selected adult residents in this city are surveyed to 
determine whether they have cell phones. Of the 500 people sampled, 421 responded yes - they 
own cell phones. Using a 95% confidence level, compute a confidence interval estimate for the 
true proportion of adult residents of this city who have cell phones. 


Solution: 
e The solution step-by-step. 


Let X = the number of people in the sample who have cell phones. X is binomial: the random 
variable is binary, people either have a cell phone or they do not. 


To calculate the confidence interval, we must find p’, q’. 
n= 500 

x = the number of successes in the sample = 421 
p—= =, — 0842 


p' = 0.842 is the sample proportion; this is the point estimate of the population proportion. 


q =p) 1 0842 0058 


Since the requested confidence level is CL = 0.95, then a= 1- CL = 1-0.95 = 0.05 ($) = 
0.025. 


Then zs = 20.025 = 1.96 


This can be found using the Standard Normal probability table in [link]. This can also be found 
in the students t table at the 0.025 column and infinity degrees of freedom because at infinite 
degrees of freedom the students t distribution becomes the standard normal distribution, Z. 


The confidence interval for the true binomial population proportion is 


Equation: 


pq’ pq’ 
cea <p<p'+ Za) 
Equation: 


Substituting in the values from above we find the confidence interval is :0.810 < p < 0.874 


Interpretation 
We estimate with 95% confidence that between 81% and 87.4% of all adult residents of this city 
have cell phones. 


Explanation of 95% Confidence Level 
Ninety-five percent of the confidence intervals constructed in this way would contain the true 
value for the population proportion of all adult residents of this city who have cell phones. 


Note: 
Try It 
Exercise: 


Problem: 


Suppose 250 randomly selected people are surveyed to determine if they own a tablet. Of the 
250 surveyed, 98 reported owning a tablet. Using a 95% confidence level, compute a confidence 
interval estimate for the true proportion of people who own tablets. 


Solution: 


(0.3315, 0.4525) 


Example: 
Exercise: 


Problem: 


The Dundee Dog Training School has a larger than average proportion of clients who compete in 
competitive professional events. A confidence interval for the population proportion of dogs that 
compete in professional events from 150 different training schools is constructed. The lower 
limit is determined to be 0.08 and the upper limit is determined to be 0.16. Determine the level 
of confidence used to construct the interval of the population proportion of dogs that compete in 
professional events. 


Solution: 


We begin with the formula for a confidence interval for a proportion because the random 
variable is binary; either the client competes in professional competitive dog events or they don't. 


Equation: 


p=pilx 


/ pi(1 — pr) 
= 


Next we find the sample proportion: 
Equation: 
0.08 + 0.16 
i= ree = 0.12 


The + that makes up the confidence interval is thus 0.04; 0.12 + 0.04 = 0.16 and 0.12 — 0.04 = 
0.08, the boundaries of the confidence interval. Finally, we solve for Z. 


z : jie | = 0.04, therefore Z = 1.51 


And then look up the probability for 1.51 standard deviations on the standard normal table. 


p(Z = 1.51) = 0.4345, p(Z) - 2 = 0.8690 or 86.90%. 


Example: 
Exercise: 


Problem: 
A financial officer for a company wants to estimate the percent of accounts receivable that are 
more than 30 days overdue. He surveys 500 accounts and finds that 300 are more than 30 days 
overdue. Compute a 90% confidence interval for the true percent of accounts receivable that are 
more than 30 days overdue, and interpret the confidence interval. 
Solution: 

e The solution is step-by-step: 
X = 300 and n = 500 
Sea ee Site 


G17 — 10.600 = 0.400 


Since confidence level = 0.90, then a = 1 — confidence level = (1 — 0.90) = 0.10( = ) = 0.05 
Ze = Z0.05 = 1.645 


This Z-value can be found using a standard normal probability table. The student's t-table can 
also be used by entering the table at the 0.05 column and reading at the line for infinite degrees 


of freedom. The t-distribution is the normal distribution at infinite degrees of freedom. This is a 
handy trick to remember in finding Z-values for commonly used levels of confidence. We use 
this formula for a confidence interval for a proportion: 

Equation: 


a) a0 
p’— Za ¥ <p cpt Zof 22 
n n 


Substituting in the values from above we find the confidence interval for the true binomial 
population proportion is 0.564 < p < 0.636 
Interpretation 


¢ We estimate with 90% confidence that the true percent of all accounts receivable overdue 
30 days is between 56.4% and 63.6%. 

¢ Alternate Wording: We estimate with 90% confidence that between 56.4% and 63.6% of 
ALL accounts are overdue 30 days. 


Explanation of 90% Confidence Level 
Ninety percent of all confidence intervals constructed in this way contain the true value for the 
population percent of accounts receivable that are overdue 30 days. 


Note: 
Try It 
Exercise: 


Problem: 
A student polls his school to see if students in the school district are for or against the new 


legislation regarding school uniforms. She surveys 600 students and finds that 480 are against 
the new legislation. 


a. Compute a 90% confidence interval for the true percent of students who are against the new 
legislation, and interpret the confidence interval. 


Solution: 


(0.7731, 0.8269); We estimate with 90% confidence that the true percent of all students in the 
district who are against the new legislation is between 77.31% and 82.69%. 

Exercise: 
Problem: 


b. Ina sample of 300 students, 68% said they own an iPod and a smart phone. Compute a 97% 
confidence interval for the true percent of students who own an iPod and a smartphone. 


Solution: 
Solution 


Sixty-eight percent (68%) of students own an iPod and a smart phone. 

p' = 0.68 

q' = 1-p’ = 1-0.68 = 0.32 

Since CL = 0.97, we know a = 1 — 0.97 = 0.03 and a = 0.015. 

The area to the left of Zo9;5 is 0.015, and the area to the right of Zo 915 is 1 — 0.015 = 0.985. 
Using the TI 83, 83+, or 84+ calculator function InvNorm(.985,0,1), 


2.015 = 2.17 


iq’ 0.68(0.32 
EPB = (eg) 2 = Biya) Soe) ~ 0.0584 
Vn 300 


p' — EPB = 0.68 — 0.0584 = 0.0584 
p' + EPB = 0.68 + 0.0584 = 0.0584 


We are 97% confident that the true proportion of all students who own an iPod and a smart 
phone is between 0.6216 and 0.7384. 
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Chapter Review 


Some statistical measures, like many survey questions, measure qualitative rather than quantitative 
data. In this case, the population parameter being estimated is a proportion. It is possible to create a 
confidence interval for the true population proportion following procedures similar to those used in 
creating confidence intervals for population means. The formulas are slightly different, but they follow 
the same reasoning. 


Let p' represent the sample proportion, x/n, where x represents the number of successes and n 
represents the sample size. Let q' = 1 — p’. Then the confidence interval for a population proportion is 
given by the following formula: 


p’— Zar/ BL <p <p’ + Zay/ Pt 


Formula Review 


p'= ~ where x represents the number of successes in a sample and n represents the sample size. The 


variable p’ is the sample proportion and serves as the point estimate for the true population proportion. 
q’ = 1 —p' 


The variable p’ has a binomial distribution that can be approximated with the normal distribution 
shown here. The confidence interval for the true population proportion is given by the formula: 


p— Zar 22 < p< p't Zar Pt 


22h 
Ze"pq 

e 
proportion, p, with confidence 1 - a and margin of error e. Where e = the acceptable difference 


between the actual population proportion and the sample proportion. 


n= provides the number of observations needed to sample to estimate the population 


Use the following information to answer the next two exercises: Marketing companies are interested in 
knowing the population percent of women who make the majority of household purchasing decisions. 
Exercise: 


Problem: 
When designing a study to determine this population proportion, what is the minimum number 


you would need to survey to be 90% confident that the population proportion is estimated to 
within 0.05? 


Exercise: 


Problem: 


If it were later determined that it was important to be more than 90% confident and a new survey 
were commissioned, how would it affect the minimum number you need to survey? Why? 


Solution: 


It would decrease, because the z-score would decrease, which reducing the numerator and 
lowering the number. 


Use the following information to answer the next five exercises: Suppose the marketing company did 
do a survey. They randomly surveyed 200 households and found that in 120 of them, the woman made 
the majority of the purchasing decisions. We are interested in the population proportion of households 
where women make the majority of the purchasing decisions. 

Exercise: 


Problem: Identify the following: 


Exercise: 


Problem: Define the random variables X and P’ in words. 


Solution: 


X is the number of “successes” where the woman makes the majority of the purchasing decisions 
for the household. P’ is the percentage of households sampled where the woman makes the 
majority of the purchasing decisions for the household. 


Exercise: 


Problem: Which distribution should you use for this problem? 
Exercise: 


Problem: 


Construct a 95% confidence interval for the population proportion of households where the 
women make the majority of the purchasing decisions. State the confidence interval, sketch the 
graph, and calculate the error bound. 


Solution: 


CI: (0.5321, 0.6679) 


0.5321 0.5 0.6679 


EBM: 0.0679 
Exercise: 
Problem: 


List two difficulties the company might have in obtaining random results, if this survey were 
done by email. 


Use the following information to answer the next five exercises: Of 1,050 randomly selected adults, 
360 identified themselves as manual laborers, 280 identified themselves as non-manual wage earners, 
250 identified themselves as mid-level managers, and 160 identified themselves as executives. In the 
survey, 82% of manual laborers preferred trucks, 62% of non-manual wage earners preferred trucks, 
54% of mid-level managers preferred trucks, and 26% of executives preferred trucks. 

Exercise: 


Problem: 


We are interested in finding the 95% confidence interval for the percent of executives who prefer 
trucks. Define random variables X and P' in words. 


Solution: 


X is the number of “successes” where an executive prefers a truck. P’ is the percentage of 
executives sampled who prefer a truck. 


Exercise: 


Problem: Which distribution should you use for this problem? 
Exercise: 


Problem: 


Construct a 95% confidence interval. State the confidence interval, sketch the graph, and 
calculate the error bound. 


Solution: 


CI: (0.19432, 0.33068) 


0.1943 0.26 0.3307 


Exercise: 


Problem: Suppose we want to lower the sampling error. What is one way to accomplish that? 


Exercise: 


Problem: The sampling error given in the survey is +2%. Explain what the +2% means. 


Solution: 


The sampling error means that the true mean can be 2% above or below the sample mean. 


Use the following information to answer the next five exercises: A poll of 1,200 voters asked what the 
most significant issue was in the upcoming election. Sixty-five percent answered the economy. We are 
interested in the population proportion of voters who feel the economy is the most important. 
Exercise: 


Problem: Define the random variable X in words. 


Exercise: 


Problem: Define the random variable P’ in words. 


Solution: 


P' is the proportion of voters sampled who said the economy is the most important issue in the 
upcoming election. 


Exercise: 


Problem: Which distribution should you use for this problem? 
Exercise: 


Problem: 

Construct a 90% confidence interval, and state the confidence interval and the error bound. 
Solution: 

CI: (0.62735, 0.67265) 


EBM: 0.02265 


Exercise: 


Problem: What would happen to the confidence interval if the level of confidence were 95%? 


Use the following information to answer the next 16 exercises: The Ice Chalet offers dozens of 
different beginning ice-skating classes. All of the class names are put into a bucket. The 5 P.M., 
Monday night, ages 8 to 12, beginning ice-skating class was picked. In that class were 64 girls and 16 
boys. Suppose that we are interested in the true proportion of girls, ages 8 to 12, in all beginning ice- 
skating classes at the Ice Chalet. Assume that the children in the selected class are a random sample of 
the population. 

Exercise: 


Problem: What is being counted? 


Solution: 
The number of girls, ages 8 to 12, in the 5 P.M. Monday night beginning ice-skating class. 


Exercise: 


Problem: In words, define the random variable X. 
Exercise: 
Problem: Calculate the following: 


awe 
bn= 
c. p'= 
Solution: 
a. X = 64 


b. n = 80 
c. p’ = 0.8 


Exercise: 


Problem: State the estimated distribution of X. X~ 


Exercise: 


Problem: Define a new random variable P’. What is p’ estimating? 
Solution: 


D 


Exercise: 


Problem: In words, define the random variable P’. 
Exercise: 


Problem: 


State the estimated distribution of P’. Construct a 92% Confidence Interval for the true proportion 
of girls in the ages 8 to 12 beginning ice-skating classes at the Ice Chalet. 


Solution: 


Pr-n(0., oa), (0.72171, 0.87829). 


Exercise: 


Problem: How much area is in both tails (combined)? 


Exercise: 


Problem: How much area is in each tail? 


Solution: 
0.04 
Exercise: 
Problem: Calculate the following: 
a. lower limit 


b. upper limit 
c. error bound 


Exercise: 


Problem: The 92% confidence interval is 


Solution: 


(0.72; 0.88) 
Exercise: 


Problem: 


Fill in the blanks on the graph with the areas, upper and lower limits of the confidence interval, 
and the sample proportion. 


Exercise: 


Problem: In one complete sentence, explain what the interval means. 


Solution: 
With 92% confidence, we estimate the proportion of girls, ages 8 to 12, in a beginning ice-skating 
class at the Ice Chalet to be between 72% and 88%. 
Exercise: 
Problem: 
Using the same p’ and level of confidence, suppose that n were increased to 100. Would the error 
bound become larger or smaller? How do you know? 
Exercise: 
Problem: 


Using the same p’ and n = 80, how would the error bound change if the confidence level were 
increased to 98%? Why? 


Solution: 


The error bound would increase. Assuming all other variables are kept constant, as the confidence 
level increases, the area under the curve corresponding to the confidence level becomes larger, 
which creates a wider interval and thus a larger error. 


Exercise: 


Problem: 


If you decreased the allowable error bound, why would the minimum sample size increase 
(keeping the same level of confidence)? 


Homework 


Exercise: 


Problem: 


Insurance companies are interested in knowing the population percent of drivers who always 
buckle up before riding in a car. 


a. When designing a study to determine this population proportion, what is the minimum 
number you would need to survey to be 95% confident that the population proportion is 
estimated to within 0.03? 

b. If it were later determined that it was important to be more than 95% confident and a new 
survey was commissioned, how would that affect the minimum number you would need to 
survey? Why? 


Solution: 


a. 1,068 
b. The sample size would need to be increased since the critical value increases as the 
confidence level increases. 


Exercise: 


Problem: 


Suppose that the insurance companies did do a survey. They randomly surveyed 400 drivers and 
found that 320 claimed they always buckle up. We are interested in the population proportion of 
drivers who claim they always buckle up. 


a. ix= 
inn= 
iii. p’ = 


b. Define the random variables X and P’, in words. 

c. Which distribution should you use for this problem? Explain your choice. 

d. Construct a 95% confidence interval for the population proportion who claim they always 
buckle up. 


i. State the confidence interval. 
ii. Sketch the graph. 


e. If this survey were done by telephone, list three difficulties the companies might have in 
obtaining random results. 


Exercise: 


Problem: 


According to a recent survey of 1,200 people, 61% feel that the president is doing an acceptable 
job. We are interested in the population proportion of people who feel the president is doing an 
acceptable job. 


a. Define the random variables X and P’ in words. 

b. Which distribution should you use for this problem? Explain your choice. 

c. Construct a 90% confidence interval for the population proportion of people who feel the 
president is doing an acceptable job. 


i. State the confidence interval. 
ii. Sketch the graph. 


Solution: 
a. X = the number of people who feel that the president is doing an acceptable job; 


P' = the proportion of people in a sample who feel that the president is doing an acceptable 
job. 


b. N (0.61, f oe ) 


c. i. Cl: (0.59, 0.63) 
ii. Check student’s solution 


Exercise: 


Problem: 


An article regarding interracial dating and marriage recently appeared in the Washington Post. Of 
the 1,709 randomly selected adults, 315 identified themselves as Latinos, 323 identified 
themselves as blacks, 254 identified themselves as Asians, and 779 identified themselves as 
whites. In this survey, 86% of blacks said that they would welcome a white person into their 
families. Among Asians, 77% would welcome a white person into their families, 71% would 
welcome a Latino, and 66% would welcome a black person. 


a. We are interested in finding the 95% confidence interval for the percent of all black adults 
who would welcome a white person into their families. Define the random variables X and 
P’, in words. 

b. Which distribution should you use for this problem? Explain your choice. 

c. Construct a 95% confidence interval. 


i. State the confidence interval. 
ii. Sketch the graph. 


Exercise: 


Problem: Refer to the information in [link]. 
a. Construct three 95% confidence intervals. 


i. percent of all Asians who would welcome a white person into their families. 
ii. percent of all Asians who would welcome a Latino into their families. 
iii. percent of all Asians who would welcome a black person into their families. 


b. Even though the three point estimates are different, do any of the confidence intervals 
overlap? Which? 

c. For any intervals that do overlap, in words, what does this imply about the significance of 
the differences in the true proportions? 

d. For any intervals that do not overlap, in words, what does this imply about the significance 
of the differences in the true proportions? 


Solution: 


a. i. (0.72, 0.82) 
ii. (0.65, 0.76) 
iii. (0.60, 0.72) 


b. Yes, the intervals (0.72, 0.82) and (0.65, 0.76) overlap, and the intervals (0.65, 0.76) and 
(0.60, 0.72) overlap. 

c. We can say that there does not appear to be a significant difference between the proportion 
of Asian adults who say that their families would welcome a white person into their families 
and the proportion of Asian adults who say that their families would welcome a Latino 
person into their families. 

d. We can say that there is a significant difference between the proportion of Asian adults who 
say that their families would welcome a white person into their families and the proportion 
of Asian adults who say that their families would welcome a black person into their families. 


Exercise: 


Problem: 


Stanford University conducted a study of whether running is healthy for men and women over 
age 50. During the first eight years of the study, 1.5% of the 451 members of the 50-Plus Fitness 
Association died. We are interested in the proportion of people over 50 who ran and died in the 
same eight-year period. 


a. Define the random variables X and P’ in words. 

b. Which distribution should you use for this problem? Explain your choice. 

c. Construct a 97% confidence interval for the population proportion of people over 50 who 
ran and died in the same eight-year period. 


i. State the confidence interval. 
ii. Sketch the graph. 


d. Explain what a “97% confidence interval” means for this study. 


Exercise: 


Problem: 


A telephone poll of 1,000 adult Americans was reported in an issue of Time Magazine. One of 
the questions asked was “What is the main problem facing the country?” Twenty percent 
answered “crime.” We are interested in the population proportion of adult Americans who feel 
that crime is the main problem. 


a. Define the random variables X and P’ in words. 

b. Which distribution should you use for this problem? Explain your choice. 

c. Construct a 95% confidence interval for the population proportion of adult Americans who 
feel that crime is the main problem. 


i. State the confidence interval. 
ii. Sketch the graph. 


d. Suppose we want to lower the sampling error. What is one way to accomplish that? 
e. The sampling error given by Yankelovich Partners, Inc. (which conducted the poll) is +3%. 
In one to three complete sentences, explain what the +3% represents. 


Solution: 


a. X = the number of adult Americans who feel that crime is the main problem; P’ = the 
proportion of adult Americans who feel that crime is the main problem 
b. Since we are estimating a proportion, given P’ = 0.2 and n = 1000, the distribution we should 


; (0.2)(0.8) 
use is NV (02, VS). 


ce. i. Ch: (0.18, 0.22) 
ii. Check student’s solution. 


d. One way to lower the sampling error is to increase the sample size. 

e. The stated “+ 3%” represents the maximum error bound. This means that those doing the 
study are reporting a maximum error of 3%. Thus, they estimate the percentage of adult 
Americans who feel that crime is the main problem to be between 18% and 22%. 


Exercise: 


Problem: 


Refer to [link]. Another question in the poll was “[How much are] you worried about the quality 
of education in our schools?” Sixty-three percent responded “a lot”. We are interested in the 
population proportion of adult Americans who are worried a lot about the quality of education in 
our schools. 


a. Define the random variables X and P’ in words. 

b. Which distribution should you use for this problem? Explain your choice. 

c. Construct a 95% confidence interval for the population proportion of adult Americans who 
are worried a lot about the quality of education in our schools. 


i. State the confidence interval. 
ii. Sketch the graph. 


d. The sampling error given by Yankelovich Partners, Inc. (which conducted the poll) is +3%. 
In one to three complete sentences, explain what the +3% represents. 


Use the following information to answer the next three exercises: According to a Field Poll, 79% of 
California adults (actual results are 400 out of 506 surveyed) feel that “education and our schools” is 
one of the top issues facing California. We wish to construct a 90% confidence interval for the true 
proportion of California adults who feel that education and the schools is one of the top issues facing 
California. 

Exercise: 


Problem: A point estimate for the true population proportion is: 


a. 0.90 
b. 1.27 
c. 0.79 
d. 400 


Solution: 


Cc 


Exercise: 


Problem: A 90% confidence interval for the population proportion is 


a. (0.761, 0.820) 
b. (0.125, 0.188) 
c. (0.755, 0.826) 
d. (0.130, 0.183) 


Use the following information to answer the next two exercises: Five hundred and eleven (511) homes 
in a certain southern California community are randomly surveyed to determine if they meet minimal 
earthquake preparedness recommendations. One hundred seventy-three (173) of the homes surveyed 
met the minimum recommendations for earthquake preparedness, and 338 did not. 

Exercise: 


Problem: 


Find the confidence interval at the 90% Confidence Level for the true population proportion of 
southern California community homes meeting at least the minimum recommendations for 
earthquake preparedness. 


a. (0.2975, 0.3796) 
b. (0.6270, 0.6959) 
c. (0.3041, 0.3730) 
d. (0.6204, 0.7025) 


Exercise: 
Problem: 


The point estimate for the population proportion of homes that do not meet the minimum 
recommendations for earthquake preparedness is 


a. 0.6614 
b. 0.3386 
c. 173 
d. 338 


Solution: 


a 
Exercise: 


Problem: 


On May 23, 2013, Gallup reported that of the 1,005 people surveyed, 76% of U.S. workers 
believe that they will continue working past retirement age. The confidence level for this study 
was reported at 95% with a +3% margin of error. 


a. Determine the estimated proportion from the sample. 

b. Determine the sample size. 

c. Identify CL and a. 

d. Calculate the error bound based on the information provided. 

e. Compare the error bound in part d to the margin of error reported by Gallup. Explain any 
differences between the values. 

f. Create a confidence interval for the results of this study. 

g. A reporter is covering the release of this study for a local news station. How should she 
explain the confidence interval to her audience? 


Exercise: 


Problem: 


A national survey of 1,000 adults was conducted on May 13, 2013 by Rasmussen Reports. It 
concluded with 95% confidence that 49% to 55% of Americans believe that big-time college 
sports programs corrupt the process of higher education. 


a. Find the point estimate and the error bound for this confidence interval. 

b. Can we (with 95% confidence) conclude that more than half of all American adults believe 
this? 

c. Use the point estimate from part a and n = 1,000 to calculate a 75% confidence interval for 
the proportion of American adults that believe that major college sports programs corrupt 
higher education. 

d. Can we (with 75% confidence) conclude that at least half of all American adults believe 


this? 
Solution: 
a. p' = 05 +08) — 0.52; EBP = 0.55 - 0.52 = 0.03 


b. No, the confidence interval includes values less than or equal to 0.50. It is possible that less 
than half of the population believe this. 
c. CL = 0.75, so a= 1-0.75 = 0.25 and F = 0.125 za = 1.150. (The area to the right of this 


z is 0.125, so the area to the left is 1 —- 0.125 = 0.875.) 


EBP = (1.150),/°70*) ~ 0.018 


(p' - EBP, p' + EBP) = (0.52 — 0.018, 0.52 + 0.018) = (0.502, 0.538) 

d. Yes — this interval does not fall less than 0.50 so we can conclude that at least half of all 
American adults believe that major sports programs corrupt education — but we do so with 
only 75% confidence. 


Exercise: 


Problem: 


Public Policy Polling recently conducted a survey asking adults across the U.S. about music 
preferences. When asked, 80 of the 571 participants admitted that they have illegally downloaded 
music. 


a. Create a 99% confidence interval for the true proportion of American adults who have 
illegally downloaded music. 

b. This survey was conducted through automated telephone interviews on May 6 and 7, 2013. 
The error bound of the survey compensates for sampling error, or natural variability among 
samples. List some factors that could affect the survey’s outcome that are not covered by the 
margin of error. 

c. Without performing any calculations, describe how the confidence interval would change if 
the confidence level changed from 99% to 90%. 


Exercise: 


Problem: 


You plan to conduct a survey on your college campus to learn about the political awareness of 
students. You want to estimate the true proportion of college students on your campus who voted 
in the 2012 presidential election with 95% confidence and a margin of error no greater than five 
percent. How many students must you interview? 


Glossary 


Binomial Distribution 
a discrete random variable (RV) which arises from Bernoulli trials; there are a fixed number, n, of 
independent trials. “Independent” means that the result of any trial (for example, trial 1) does not 
affect the results of the following trials, and all trials are conducted under the same conditions. 
Under these circumstances the binomial RV X is defined as the number of successes in n trials. 
The notation is: X~B(n,p). The mean is p = np and the standard deviation is o = ,/npq. The 


n 
probability of exactly x successes in n trials is P (x = c) = ( ) pq”. 
6 
Error Bound for a Population Proportion (EBP) 


the margin of error; depends on the confidence level, the sample size, and the estimated (from the 
sample) proportion of successes. 


Calculating the Sample Size n: Continuous and Binary Random Variables 


Continuous Random Variables 

Usually we have no control over the sample size of a data set. However, if we are 
able to set the sample size, as in cases where we are taking a survey, it is very 
helpful to know just how large it should be to provide the most information. 
Sampling can be very costly in both time and product. Simple telephone surveys 
will cost approximately $30.00 each, for example, and some sampling requires 
the destruction of the product. 


If we go back to our standardizing formula for the sampling distribution for 
means, we can see that it is possible to solve it for n. If we do this we have 


(x = 1) in the denominator. 


Equation: 


Because we have not taken a sample yet we do not know any of the variables in 
the formula except that we can set Z, to the level of confidence we desire just as 
we did when determining confidence intervals. If we set a predetermined 


acceptable error, or tolerance, for the difference between X and u, called e in the 
formula, we are much further in solving for the sample size n. We still do not 
know the population standard deviation, o. In practice, a pre-survey is usually 
done which allows for fine tuning the questionnaire and will give a sample 
standard deviation that can be used. In other cases, previous information from 
other surveys may be used for o in the formula. While crude, this method of 
determining the sample size may help in reducing cost significantly. It will be the 
actual data gathered that determines the inferences about the population, so 
caution in the sample size is appropriate calling for high levels of confidence and 
small sampling errors. 


Binary Random Variables 

What was done in cases when looking for the mean of a distribution can also be 
done when sampling to determine the population parameter p for proportions. 
Manipulation of the standardizing formula for proportions gives: 

Equation: 


_ 4yp4 
e2 


where e = (p’-p), and is the acceptable sampling error, or tolerance, for this 
application. This will be measured in percentage points. 


In this case the very object of our search is in the formula, p, and of course q 
because q =1-p. This result occurs because the binomial distribution is a one 
parameter distribution. If we know p then we know the mean and the standard 
deviation. Therefore, p shows up in the standard deviation of the sampling 
distribution which is where we got this formula. If, in an abundance of caution, 
we substitute 0.5 for p we will draw the largest required sample size that will 
provide the level of confidence specified by Za and the tolerance we have 
selected. This is true because of all combinations of two fractions that add to 
one, the largest multiple is when each is 0.5. Without any other information 
concerning the population parameter p, this is the common practice. This may 
result in oversampling, but certainly not under sampling, thus, this is a cautious 
approach. 


There is an interesting trade-off between the level of confidence and the sample 
size that shows up here when considering the cost of sampling. [link] shows the 
appropriate sample size at different levels of confidence and different level of the 
acceptable error, or tolerance. 


Required sample size Required sample size Tolerance 
(90%) (95%) level 

1691 2401 2% 

752 1067 3% 

271 384 5% 


68 96 10% 


This table is designed to show the maximum sample size required at different 
levels of confidence given an assumed p= 0.5 and q=0.5 as discussed above. 


The acceptable error, called tolerance in the table, is measured in plus or minus 
values from the actual proportion. For example, an acceptable error of 5% means 
that if the sample proportion was found to be 26 percent, the conclusion would 
be that the actual population proportion is between 21 and 31 percent with a 90 
percent level of confidence if a sample of 271 had been taken. Likewise, if the 
acceptable error was set at 2%, then the population proportion would be between 
24 and 28 percent with a 90 percent level of confidence, but would require that 
the sample size be increased from 271 to 1,691. If we wished a higher level of 
confidence, we would require a larger sample size. Moving from a 90 percent 
level of confidence to a 95 percent level at a plus or minus 5% tolerance requires 
changing the sample size from 271 to 384. A very common sample size often 
seen reported in political surveys is 384. With the survey results it is frequently 
stated that the results are good to a plus or minus 5% level of “accuracy”. 


Example: 
Exercise: 


Problem: 


Suppose a mobile phone company wants to determine the current 
percentage of customers aged 50+ who use text messaging on their cell 
phones. How many customers aged 50+ should the company survey in 
order to be 90% confident that the estimated (sample) proportion is within 
three percentage points of the true population proportion of customers aged 
50+ who use text messaging on their cell phones. 


Solution: 


From the problem, we know that the acceptable error, e, is 0.03 (3%=0.03) 
and z2 Zo.95 = 1.645 because the confidence level is 90%. The acceptable 
error, e, is the difference between the actual population proportion p, and 
the sample proportion we expect to get from the sample. 


However, in order to find n, we need to know the estimated (sample) 
proportion p’. Remember that q' = 1 — p’. But, we do not know p’ yet. Since 
we multiply p’ and q' together, we make them both equal to 0.5 because 


p’q' = (0.5)(0.5) = 0.25 results in the largest possible product. (Try other 
products: (0.6)(0.4) = 0.24; (0.3)(0.7) = 0.21; (0.2)(0.8) = 0.16 and so on). 
The largest possible product gives us the largest n. This gives us a large 
enough sample so that we can be 90% confident that we are within three 
percentage points of the true population proportion. To calculate the sample 
size n, use the formula and make the substitutions. 


1.6457(0.5)(0.5) 
0.037 


Dp 
zPYq 
e2 


— ont 


oo gives n = 
Round the answer to the next higher value. The sample size should be 752 
cell phone customers aged 50+ in order to be 90% confident that the 
estimated (sample) proportion is within three percentage points of the true 
population proportion of all customers aged 50+ who use text messaging 
on their cell phones. 


Note: 
Try It 
Exercise: 


Problem: 

Suppose an internet marketing company wants to determine the current 
percentage of customers who click on ads on their smartphones. How many 
customers should the company survey in order to be 90% confident that the 


estimated proportion is within five percentage points of the true population 
proportion of customers who click on ads on their smartphones? 


Solution: 


271 customers should be surveyed.Check the Real Estate section in your 
local 


Chapter Review 


Sometimes researchers know in advance that they want to estimate a population 
mean within a specific margin of error for a given level of confidence. In that 


case, solve the relevant confidence interval formula for n to discover the size of 
the sample that is needed to achieve this goal: 


Zig? 
(2-n)? 


If the random variable is binary then the formula for the appropriate sample size 
to maintain a particular level of confidence with a specific tolerance level is 
given by 


= 


_ Zipq 
= =p 


Formula Review 


n= Cae = the formula used to determine the sample size (n) needed to 
cL 


achieve a desired margin of error at a given level of confidence for a continuous 
random variable 


Zz? , ae 
n = —2}* = the formula used to determine the sample size if the random 
variable is binary 


Use the following information to answer the next five exercises: The standard 
deviation of the weights of elephants is known to be approximately 15 pounds. 
We wish to construct a 95% confidence interval for the mean weight of newborn 
elephant calves. Fifty newborn elephants are weighed. The sample mean is 244 
pounds. The sample standard deviation is 11 pounds. 

Exercise: 


Problem: Identify the following: 


Solution: 


a. 244 


b<i5 
c. 50 


Exercise: 


Problem: In words, define the random variables X and xX ; 


Exercise: 


Problem: Which distribution should you use for this problem? 


Solution: 
_15_ 
w (24 2) 
Exercise: 
Problem: 
Construct a 95% confidence interval for the population mean weight of 


newborn elephants. State the confidence interval, sketch the graph, and 
calculate the error bound. 


Exercise: 
Problem: 


What will happen to the confidence interval obtained, if 500 newborn 
elephants are weighed instead of 50? Why? 


Solution: 


As the sample size increases, there will be less variability in the mean, so 
the interval size decreases. 


Use the following information to answer the next seven exercises: The U.S. 
Census Bureau conducts a study to determine the time needed to complete the 
short form. The Bureau surveys 200 people. The sample mean is 8.2 minutes. 


There is a known standard deviation of 2.2 minutes. The population distribution 
is assumed to be normal. 
Exercise: 


Problem: Identify the following: 


Exercise: 


Problem: In words, define the random variables X and xX : 


Solution: 


X is the time in minutes it takes to complete the U.S. Census short form. X 
is the mean time it took a sample of 200 people to complete the U.S. Census 
short form. 


Exercise: 


Problem: Which distribution should you use for this problem? 
Exercise: 


Problem: 

Construct a 90% confidence interval for the population mean time to 
complete the forms. State the confidence interval, sketch the graph, and 
calculate the error bound. 


Solution: 


CI: (7.9441, 8.4559) 


CL = 0.90 


7.94 8.2 8.46 


Exercise: 
Problem: 
If the Census wants to increase its level of confidence and keep the error 
bound the same by taking another survey, what changes should it make? 
Exercise: 
Problem: 
If the Census did another survey, kept the error bound the same, and 


surveyed only 50 people instead of 200, what would happen to the level of 
confidence? Why? 


Solution: 


The level of confidence would decrease because decreasing n makes the 
confidence interval wider, so at the same error bound, the confidence level 
decreases. 


Exercise: 


Problem: 


Suppose the Census needed to be 98% confident of the population mean 
length of time. Would the Census have to survey more people? Why or why 
not? 


Use the following information to answer the next ten exercises: A sample of 20 
heads of lettuce was selected. Assume that the population distribution of head 
weight is normal. The weight of each head of lettuce was then recorded. The 
mean weight was 2.2 pounds with a standard deviation of 0.1 pounds. The 
population standard deviation is known to be 0.2 pounds. 


Exercise: 


Problem: Identify the following: 


a. 2 = 
b.o= 
Cn 

Solution: 
AS 22 
b. 0 =0.2 
c.n=20 

Exercise: 


Problem: In words, define the random variable X. 


Exercise: 


Problem: In words, define the random variable x ; 


Solution: 


x is the mean weight of a sample of 20 heads of lettuce. 


Exercise: 


Problem: Which distribution should you use for this problem? 
Exercise: 
Problem: 
Construct a 90% confidence interval for the population mean weight of the 


heads of lettuce. State the confidence interval, sketch the graph, and 
calculate the error bound. 


Solution: 


EBM = 0.07 
CI: (2.1264, 2.2736) 
CL = 0.90 


Exercise: 
Problem: 
Construct a 95% confidence interval for the population mean weight of the 


heads of lettuce. State the confidence interval, sketch the graph, and 
calculate the error bound. 


Exercise: 
Problem: 


In complete sentences, explain why the confidence interval in [link] is 
larger than in [link]. 


Solution: 


The interval is greater because the level of confidence increased. If the only 
change made in the analysis is a change in confidence level, then all we are 
doing is changing how much area is being calculated for the normal 
distribution. Therefore, a larger confidence level results in larger areas and 
larger intervals. 


Exercise: 
Problem: 
In complete sentences, give an interpretation of what the interval in [link] 
means. 


Exercise: 


Problem: 


What would happen if 40 heads of lettuce were sampled instead of 20, and 
the error bound remained the same? 


Solution: 


The confidence level would increase. 
Exercise: 
Problem: 


What would happen if 40 heads of lettuce were sampled instead of 20, and 
the confidence level remained the same? 


Use the following information to answer the next 14 exercises: The mean age for 
all Foothill College students for a recent Fall term was 33.2. The population 
standard deviation has been pretty consistent at 15. Suppose that twenty-five 
Winter students were randomly selected. The mean age for the sample was 30.4. 
We are interested in the true mean age for Winter Foothill College students. Let 
X = the age of a Winter Foothill College student. 

Exercise: 


Problem: x = 


Solution: 


30.4 


Exercise: 


Problem: n = 


Exercise: 


Problem: S15 


Solution: 


O 


Exercise: 


Problem: In words, define the random variable x ; 


Exercise: 


Problem: What is x estimating? 
Solution: 


iv 
Exercise: 


Problem: Is o,, known? 
Exercise: 


Problem: 


As aresult of your answer to [link], state the exact distribution to use when 
calculating the confidence interval. 


Solution: 


normal 


Construct a 95% Confidence Interval for the true mean age of Winter Foothill 
College students by working out then answering the next seven exercises. 
Exercise: 


Problem: How much area is in both tails (combined)? a = 


Exercise: 


Problem: How much area is in each tail? = = 


Solution: 


0.025 
Exercise: 
Problem: Identify the following specifications: 
a. lower limit 


b. upper limit 
c. error bound 


Exercise: 


Problem: The 95% confidence interval is: 


Solution: 


(24.52,36.28) 
Exercise: 


Problem: 


Fill in the blanks on the graph with the areas, upper and lower limits of the 
confidence interval, and the sample mean. 


Exercise: 


Problem: In one complete sentence, explain what the interval means. 
Solution: 


We are 95% confident that the true mean age for Winger Foothill College 
students is between 24.52 and 36.28. 


Exercise: 
Problem: 
Using the same mean, standard deviation, and level of confidence, suppose 


that n were 69 instead of 25. Would the error bound become larger or 
smaller? How do you know? 


Exercise: 
Problem: 


Using the same mean, standard deviation, and sample size, how would the 
error bound change if the confidence level were reduced to 90%? Why? 


Solution: 


The error bound for the mean would decrease because as the CL decreases, 
you need less area under the normal curve (which translates into a smaller 
interval) to capture the true population mean. 


Exercise: 
Problem: 
Find the value of the sample size needed to if the confidence interval is 90% 
that the sample proportion and the population proportion are within 4% of 


each other. The sample proportion is 0.60. Note: Round all fractions up for 
n. 


Exercise: 
Problem: 
Find the value of the sample size needed to if the confidence interval is 95% 
that the sample proportion and the population proportion are within 2% of 


each other. The sample proportion is 0.650. Note: Round all fractions up for 
n. 


Solution: 


2,185 


Exercise: 


Problem: 


Find the value of the sample size needed to if the confidence interval is 96% 
that the sample proportion and the population proportion are within 5% of 
each other. The sample proportion is 0.70. Note: Round all fractions up for 
n. 


Exercise: 
Problem: 
Find the value of the sample size needed to if the confidence interval is 90% 
that the sample proportion and the population proportion are within 1% of 


each other. The sample proportion is 0.50. Note: Round all fractions up for 
n. 


Solution: 


6,765 
Exercise: 
Problem: 
Find the value of the sample size needed to if the confidence interval is 94% 
that the sample proportion and the population proportion are within 2% of 


each other. The sample proportion is 0.65. Note: Round all fractions up for 
n. 


Exercise: 
Problem: 
Find the value of the sample size needed to if the confidence interval is 95% 
that the sample proportion and the population proportion are within 4% of 


each other. The sample proportion is 0.45. Note: Round all fractions up for 
n. 


Solution: 


O93 


Exercise: 


Problem: 


Find the value of the sample size needed to if the confidence interval is 90% 
that the sample proportion and the population proportion are within 2% of 
each other. The sample proportion is 0.3. Note: Round all fractions up for n. 


Homework 


Exercise: 


Problem: 


Among various ethnic groups, the standard deviation of heights is known to 
be approximately three inches. We wish to construct a 95% confidence 
interval for the mean height of male Swedes. Forty-eight male Swedes are 
surveyed. The sample mean is 71 inches. The sample standard deviation is 
2.8 inches. 


a; i.x= 


b. In words, define the random variables X and X. 

c. Which distribution should you use for this problem? Explain your 
choice. 

d. Construct a 95% confidence interval for the population mean height of 
male Swedes. 


i. State the confidence interval. 
ii. Sketch the graph. 


e. What will happen to the level of confidence obtained if 1,000 male 
Swedes are surveyed instead of 48? Why? 


Solution: 


a. Lek 
ii. 2.8 


iil. 48 


b. X is the height of a male Swede, and z is the mean height from a 
sample of 48 male Swedes. 

c. Normal. We know the standard deviation for the population, and the 
sample size is greater than 30. 


d. @90CR (70151, 71,85) 


70.15 71.85 


il. 


e. The confidence interval will decrease in size, because the sample size 
increased. Recall, when all factors remain unchanged, an increase in 
sample size decreases variability. Thus, we do not need as large an 
interval to capture the true population mean. 


Exercise: 


Problem: 


Announcements for 84 upcoming engineering conferences were randomly 
picked from a stack of IEEE Spectrum magazines. The mean length of the 
conferences was 3.94 days, with a standard deviation of 1.28 days. Assume 
the underlying population is normal. 


a. In words, define the random variables X and X. 

b. Which distribution should you use for this problem? Explain your 
choice. 

c. Construct a 95% confidence interval for the population mean length of 
engineering conferences. 


i. State the confidence interval. 
ii. Sketch the graph. 


Exercise: 


Problem: 


Suppose that an accounting firm does a study to determine the time needed 
to complete one person’s tax forms. It randomly surveys 100 people. The 
sample mean is 23.6 hours. There is a known standard deviation of 7.0 
hours. The population distribution is assumed to be normal. 


a. 


b. 
c 


d. 


ie) 


lone) 


i. x= 
ii. 0 = 
ite 


In words, define the random variables X and X. 

Which distribution should you use for this problem? Explain your 
choice. 

Construct a 90% confidence interval for the population mean time to 
complete the tax forms. 


i. State the confidence interval. 
ii. Sketch the graph. 


. If the firm wished to increase its level of confidence and keep the error 


bound the same by taking another survey, what changes should it 
make? 


. If the firm did another survey, kept the error bound the same, and only 


surveyed 49 people, what would happen to the level of confidence? 
Why? 

Suppose that the firm decided that it needed to be at least 96% 
confident of the population mean length of time to within one hour. 
How would the number of people the firm surveys change? Why? 


Solution: 
a i2=23.6 
li.o =7 


iii. n = 100 


b. X is the time needed to complete an individual tax form. X is the mean 
time to complete tax forms from a sample of 100 customers. 


c. N (23.6, ) because we know sigma. 


7 
V100 
d. i, (22.228, 24.972) 


22.228 24.972 


il. 


e. It will need to change the sample size. The firm needs to determine 
what the confidence level should be, then apply the error bound 
formula to determine the necessary sample size. 

f. The confidence level would increase as a result of a larger interval. 
Smaller sample sizes result in more variability. To capture the true 
population mean, we need to have a larger interval. 

g. According to the error bound formula, the firm needs to survey 206 
people. Since we increase the confidence level, we need to increase 
either our error bound or the sample size. 


Exercise: 


Problem: 


A sample of 16 small bags of the same brand of candies was selected. 
Assume that the population distribution of bag weights is normal. The 
weight of each bag was then recorded. The mean weight was two ounces 
with a standard deviation of 0.12 ounces. The population standard deviation 
is known to be 0.1 ounce. 


a. i. x= 


b. In words, define the random variable X. 


c. In words, define the random variable X. 

d. Which distribution should you use for this problem? Explain your 
choice. 

e. Construct a 90% confidence interval for the population mean weight of 
the candies. 


i. State the confidence interval. 
ii. Sketch the graph. 


f. Construct a 98% confidence interval for the population mean weight of 
the candies. 


i. State the confidence interval. 
ii. Sketch the graph. 
iii. Calculate the error bound. 


g. In complete sentences, explain why the confidence interval in part f is 
larger than the confidence interval in part e. 

h. In complete sentences, give an interpretation of what the interval in 
part f means. 


Exercise: 


Problem: 


A camp director is interested in the mean number of letters each child sends 
during his or her camp session. The population standard deviation is known 
to be 2.5. A survey of 20 campers is taken. The mean from the sample is 7.9 
with a sample standard deviation of 2.8. 


a. i. x= 
il. 0 = 
ict 


b. Define the random variables X and xX in words. 
c. Which distribution should you use for this problem? Explain your 
choice. 


d. Construct a 90% confidence interval for the population mean number 
of letters campers send home. 


i. State the confidence interval. 
ii. Sketch the graph. 


e. What will happen to the error bound and confidence interval if 500 
campers are surveyed? Why? 


Solution: 
a i. 7.9 
1.25 
iii. 20 


b. X is the number of letters a single camper will send home. X is the 
mean number of letters sent home from a sample of 20 campers. 


2.5 
c.7.9(25.) 


d. ‘iCly(6.98, 8:82) 


x! 


6.98 8.82 


ii; 
e. The error bound and confidence interval will decrease. 
Exercise: 
Problem: 


What is meant by the term “90% confident” when constructing a confidence 
interval for a mean? 


a. If we took repeated samples, approximately 90% of the samples would 
produce the same confidence interval. 

b. If we took repeated samples, approximately 90% of the confidence 
intervals calculated from those samples would contain the sample 


mean. 


c. If we took repeated samples, approximately 90% of the confidence 
intervals calculated from those samples would contain the true value of 
the population mean. 

d. If we took repeated samples, the sample mean would equal the 
population mean in approximately 90% of the samples. 


Exercise: 


Problem: 


The Federal Election Commission collects information about campaign 
contributions and disbursements for candidates and political committees 
each election cycle. During the 2012 campaign season, there were 1,619 
candidates for the House of Representatives across the United States who 
received contributions from individuals. [link] shows the total receipts from 
individuals for a random selection of 40 House candidates rounded to the 
nearest $100. The standard deviation for this data to the nearest hundred is o 


= $909,200. 


$3,600 
$7,400 
$391,000 
$733,200 
$13,300 


$353,900 


$1,243,900 
$2,900 
$467,400 
$8,000 
$9,500 


$986, 100 


$10,900 
$400 
$56,800 
$468,700 
$953,800 


$88,600 


$385,200 
$3,714,500 
$5,800 
$75,200 
$1,113,500 


$378,200 


$581,500 
$632,500 
$405,200 
$41,000 
$1,109,300 


$13,200 


$3,800 $745,100 $5,800 $3,072,100 $1,626,700 


$512,900 $2,309,200 $6,600 $202,400 $15,800 


a. Find the point estimate for the population mean. 

b. Using 95% confidence, calculate the error bound. 

c. Create a 95% confidence interval for the mean total individual 
contributions. 

d. Interpret the confidence interval in the context of the problem. 


Solution: 


a. £ = $568,873 
b. CL = 0.95 a= 1-0.95 = 0.05 zz = 1.96 


EBM = 20.025 Fe = 1.96 age = $281,764 


c. 2 — EBM = 568,873 — 281,764 = 287,109 
x + EBM = 568,873 + 281,764 = 850,637 
d. We estimate with 95% confidence that the mean amount of 


contributions received from all individuals by House candidates is 
between $287,109 and $850,637. 


Exercise: 


Problem: 


The American Community Survey (ACS), part of the United States Census 
Bureau, conducts a yearly census similar to the one taken every ten years, 
but with a smaller percentage of participants. The most recent survey 
estimates with 90% confidence that the mean household income in the U.S. 
falls between $69,720 and $69,922. Find the point estimate for mean U.S. 
household income and the error bound for mean U.S. household income. 


Exercise: 


Problem: 


The average height of young adult males has a normal distribution with 
standard deviation of 2.5 inches. You want to estimate the mean height of 
students at your college or university to within one inch with 93% 
confidence. How many male students must you measure? 


Exercise: 


Problem: 


If the confidence interval is change to a higher probability, would this cause 
a lower, or a higher, minimum sample size? 


Solution: 


Higher 
Exercise: 


Problem: 


If the tolerance is reduced by half, how would this affect the minimum 
sample size? 


Solution: 


It would increase to four times the prior value. 
Exercise: 
Problem: 


If the value of p is reduced, would this necessarily reduce the sample size 
needed? 


Solution: 


No, It could have no affect if it were to change to 1 — p, for example. If it 
gets closer to 0.5 the minimum sample size would increase. 


Exercise: 


Problem: 


2 
Is it acceptable to use a higher sample size than the one calculated by as 


Solution: 


Yes 
Exercise: 


Problem: 


A company has been running an assembly line with 97.42%% of the 
products made being acceptable. Then, a critical piece broke down. After 
the repairs the decision was made to see if the number of defective products 
made was still close enough to the long standing production quality. 
Samples of 500 pieces were selected at random, and the defective rate was 
found to be 0.025%. 


a. Is this sample size adequate to claim the company is checking within 


the 90% confidence interval? 
b. The 95% confidence interval? 


Solution: 


a. No 
b. No 


Introduction 
class="introduction' 


You can 
use a 
hypothesis 
test to 
decide if a 
dog 
breeder’s 
claim that 
every 
Dalmatian 
has 35 
spots is 
Statisticall 
y sound. 
(Credit: 
Robert 
Neff) 


a. 
Ai 


Now we are down to the bread and butter work of the statistician: 
developing and testing hypotheses. It is important to put this material in a 
broader context so that the method by which a hypothesis is formed is 
understood completely. Using textbook examples often clouds the real 
source of statistical hypotheses. 


Statistical testing is part of a much larger process known as the scientific 
method. This method was developed more than two centuries ago as the 
accepted way that new knowledge could be created. Until then, and 
unfortunately even today, among some, "knowledge" could be created 
simply by some authority saying something was so, ipso dicta. Superstition 
and conspiracy theories were (are?) accepted uncritically. 


The scientific method, briefly, states that only by following a careful and 
specific process can some assertion be included in the accepted body of 
knowledge. This process begins with a set of assumptions upon which a 


theory, sometimes called a model, is built. This theory, if it has any validity, 
will lead to predictions; what we call hypotheses. 


As an example, in Microeconomics the theory of consumer choice begins 
with certain assumption concerning human behavior. From these 
assumptions a theory of how consumers make choices using indifference 
curves and the budget line. This theory gave rise to a very important 
prediction, namely, that there was an inverse relationship between price and 
quantity demanded. This relationship was known as the demand curve. The 
negative slope of the demand curve is really just a prediction, or a 
hypothesis, that can be tested with statistical tools. 


Unless hundreds and hundreds of statistical tests of this hypothesis had not 
confirmed this relationship, the so-called Law of Demand would have been 
discarded years ago. This is the role of statistics, to test the hypotheses of 
various theories to determine if they should be admitted into the accepted 
body of knowledge; how we understand our world. Once admitted, 
however, they may be later discarded if new theories come along that make 
better predictions. 


Not long ago two scientists claimed that they could get more energy out of a 
process than was put in. This caused a tremendous stir for obvious reasons. 
They were on the cover of Time and were offered extravagant sums to bring 
their research work to private industry and any number of universities. It 
was not long until their work was subjected to the rigorous tests of the 
scientific method and found to be a failure. No other lab could replicate 
their findings. Consequently they have sunk into obscurity and their theory 
discarded. It may surface again when someone can pass the tests of the 
hypotheses required by the scientific method, but until then it is just a 
curiosity. Many pure frauds have been attempted over time, but most have 
been found out by applying the process of the scientific method. 


This discussion is meant to show just where in this process statistics falls. 
Statistics and statisticians are not necessarily in the business of developing 
theories, but in the business of testing others’ theories. Hypotheses come 
from these theories based upon an explicit set of assumptions and sound 
logic. The hypothesis comes first, before any data are gathered. Data do not 
create hypotheses; they are used to test them. If we bear this in mind as we 


study this section the process of forming and testing hypotheses will make 
more sense. 


One job of a Statistician is to make statistical inferences about populations 
based on samples taken from the population. Confidence intervals are one 
way to estimate a population parameter. Another way to make a statistical 
inference is to make a decision about the value of a specific parameter. For 
instance, a car dealer advertises that its new small truck gets 35 miles per 
gallon, on average. A tutoring service claims that its method of tutoring 
helps 90% of its students get an A or a B. A company says that women 
managers in their company earn an average of $60,000 per year. 


A Statistician will make a decision about these claims. This process is called 
"hypothesis testing.” A hypothesis test involves collecting data from a 
sample and evaluating the data. Then, the statistician makes a decision as to 
whether or not there is sufficient evidence, based upon analyses of the data, 
to reject the null hypothesis. 


In this chapter, you will conduct hypothesis tests on single means and single 
proportions. You will also learn about the errors associated with these tests. 


Glossary 


Confidence Interval (CI) 
an interval estimate for an unknown population parameter. This 
depends on: 


e The desired confidence level. 

e Information that is known about the distribution (for example, 
known standard deviation). 

e The sample and its size. 


Hypothesis Testing 
Based on sample evidence, a procedure for determining whether the 
hypothesis stated is a reasonable statement and should not be rejected, 
or is unreasonable and should be rejected. 


Null and Alternative Hypotheses 


The actual test begins by considering two hypotheses. They are called the 
null hypothesis and the alternative hypothesis. These hypotheses contain 
opposing viewpoints. 


Ho: The null hypothesis: It is a statement of no difference between a 
sample mean or proportion and a population mean or proportion. In other 
words, the difference equals 0. This can often be considered the status quo 
and as a result if you cannot accept the null it requires some action. 


H,: The alternative hypothesis: It is a claim about the population that is 
contradictory to Hg and what we conclude when we cannot accept Hp. The 
alternative hypothesis is the contender and must win with significant 
evidence to overthrow the status quo. This concept is sometimes referred to 
the tyranny of the status quo because as we will see later, to overthrow the 
null hypothesis takes usually 90 or greater confidence that this is the proper 
decision. 


Since the null and alternative hypotheses are contradictory, you must 
examine evidence to decide if you have enough evidence to reject the null 
hypothesis or not. The evidence is in the form of sample data. 


After you have determined which hypothesis the sample supports, you 
make a decision. There are two options for a decision. They are "cannot 
accept H," if the sample information favors the alternative hypothesis or 
"do not reject Hg" or "decline to reject Ho" if the sample information is 
insufficient to reject the null hypothesis. These conclusions are all based 
upon a level of probability, a significance level, that is set my the analyst. 


Table 9.1 presents the various hypotheses in the relevant pairs. For example, 
if the null hypothesis is equal to some value, the alternative has to be not 
equal to that value. 


Ho Hy 


equal (=) not equal (#) 
greater than or equal to (>) less than (<) 
less than or equal to (<) more than (>) 
Note: 
Note 


As a mathematical convention Hg always has a symbol with an equal in it. 
H, never has a symbol with an equal in it. The choice of symbol depends 
on the wording of the hypothesis test. 


Example: 

Ho: No more than 30% of the registered voters in Santa Clara County voted 
in the primary election. p < 30 

H,: More than 30% of the registered voters in Santa Clara County voted in 
the primary election. p > 30 


Example: 

We want to test whether the mean GPA of students in American colleges is 
different from 2.0 (out of 4.0). The null and alternative hypotheses are: 

Ho: [ul = 2.0 

Hy: p # 2.0 


Example: 
We want to test if college students take less than five years to graduate 
from college, on the average. The null and alternative hypotheses are: 


fee iReas) 
ieee ties) 


Chapter Review 


In a hypothesis test, sample data is evaluated in order to arrive at a decision 
about some type of claim. If certain conditions about the sample are 
satisfied, then the claim can be evaluated for a population. In a hypothesis 
test, we: 


1. Evaluate the null hypothesis, typically denoted with Ho. The null is 
not rejected unless the hypothesis test shows otherwise. The null 
statement must always contain some form of equality (=, < or =) 

2. Always write the alternative hypothesis, typically denoted with H, or 
Hj, using not equal, less than or greater than symbols, i.e., (4, <, or > ). 

3. If we reject the null hypothesis, then we can assume there is enough 
evidence to support the alternative hypothesis. 

4. Never state that a claim is proven true or false. Keep in mind the 
underlying fact that hypothesis testing is based on probability laws; 
therefore, we can talk only in terms of non-absolute certainties. 


Exercise: 
Problem: 
You are testing that the mean speed of your cable Internet connection 


is more than three Megabits per second. What is the random variable? 
Describe in words. 


Solution: 


The random variable is the mean Internet speed in Megabits per 
second. 


Exercise: 


Problem: 


You are testing that the mean speed of your cable Internet connection 
is more than three Megabits per second. State the null and alternative 
hypotheses. 


Exercise: 
Problem: 


The American family has an average of two children. What is the 
random variable? Describe in words. 


Solution: 
The random variable is the mean number of children an American 
family has. 
Exercise: 
Problem: 
The mean entry level salary of an employee at a company is $58,000. 


You believe it is higher for IT professionals in the company. State the 
null and alternative hypotheses. 


Exercise: 
Problem: 
A sociologist claims the probability that a person picked at random in 
Times Square in New York City is visiting the area is 0.83. You want 


to test to see if the proportion is actually less. What is the random 
variable? Describe in words. 


Solution: 


The random variable is the proportion of people picked at random in 
Times Square visiting the city. 


Exercise: 


Problem: 


A sociologist claims the probability that a person picked at random in 
Times Square in New York City is visiting the area is 0.83. You want 
to test to see if the claim is correct. State the null and alternative 
hypotheses. 


Exercise: 
Problem: 
In a population of fish, approximately 42% are female. A test is 


conducted to see if, in fact, the proportion is less. State the null and 
alternative hypotheses. 


Solution: 
a. Ho: p = 0.42 
b. H,: p < 0.42 
Exercise: 
Problem: 


Suppose that a recent article stated that the mean time spent in jail by a 
first-time convicted burglar is 2.5 years. A study was then done to see 
if the mean time has increased in the new century. A random sample of 
26 first-time convicted burglars in a recent year was picked. The mean 
length of time in jail from the survey was 3 years with a standard 
deviation of 1.8 years. Suppose that it is somehow known that the 
population standard deviation is 1.5. If you were conducting a 
hypothesis test to determine if the mean length of jail time has 
increased, what would the null and alternative hypotheses be? The 
distribution of the population is normal. 


a. Ho: 
Ded 


Exercise: 


Problem: 


A random survey of 75 death row inmates revealed that the mean 
length of time on death row is 17.4 years with a standard deviation of 
6.3 years. If you were conducting a hypothesis test to determine if the 
population mean time on death row could likely be 15 years, what 
would the null and alternative hypotheses be? 


a. Ho: 
Lota g be 


Solution: 


a. Ho: p= 15 
b. Hg: p #15 


Exercise: 


Problem: 


The National Institute of Mental Health published an article stating 
that in any one-year period, approximately 9.5 percent of American 
adults suffer from depression or a depressive illness. Suppose that in a 
survey of 100 people in a certain town, seven of them suffered from 
depression or a depressive illness. If you were conducting a hypothesis 
test to determine if the true proportion of people in that town suffering 
from depression or a depressive illness is lower than the percent in the 
general adult American population, what would the null and 
alternative hypotheses be? 


a. Ho: 
Dy das: 


Homework 


Exercise: 


Problem: 


Some of the following statements refer to the null hypothesis, some to 
the alternate hypothesis. 


State the null hypothesis, Ho, and the alternative hypothesis. H,, in 
terms of the appropriate parameter (/ or p). 


d. 
e. 
. The mean number of cars a person owns in her lifetime is not 


mh 


ed © es SS 


a. The mean number of years Americans work before retiring is 34. 
De 
c 


At most 60% of Americans vote in presidential elections. 

The mean starting salary for San Jose State University graduates 
is at least $100,000 per year. 

Twenty-nine percent of high school seniors get drunk each month. 
Fewer than 5% of adults ride the bus to work in Los Angeles. 


more than ten. 


. About half of Americans prefer to live away from cities, given the 


choice. 


. Europeans have a mean paid vacation each year of six weeks. 
. The chance of developing breast cancer is under 11% for women. 
. Private universities' mean tuition cost is more than $20,000 per 


year. 


Solution: 


a. Ho: wp = 34; Ha: p 4 34 

b. Ho: p < 0.60; H,: p > 0.60 

c. Ho: p = 100,000; H,: p < 100,000 
d. Ho: p = 0.29; H,: p 4 0.29 

e. Hg: p = 0.05; H,: p < 0.05 

f. Ho: p < 10; Hg: p> 10 

g. Ho: p = 0.50; H,: p # 0.50 

h. Ho: p = 6; Hg: p#6 

Hy: p = 0.11; Hy: p< 0.11 

j. Ho: p < 20,000; H,: p > 20,000 


Exercise: 


Problem: 


Over the past few decades, public health officials have examined the 
link between weight concerns and teen girls' smoking. Researchers 
surveyed a group of 273 randomly selected teen girls living in 
Massachusetts (between 12 and 15 years old). After four years the girls 
were surveyed again. Sixty-three said they smoked to stay thin. Is there 
good evidence that more than thirty percent of the teen girls smoke to 
stay thin? The alternative hypothesis is: 


a. p < 0.30 
b. p < 0.30 
c. p = 0.30 
d. p > 0.30 


Exercise: 


Problem: 


A statistics instructor believes that fewer than 20% of Evergreen 
Valley College (EVC) students attended the opening night midnight 
showing of the latest Harry Potter movie. She surveys 84 of her 
students and finds that 11 attended the midnight showing. An 
appropriate alternative hypothesis is: 


a. p = 0.20 
bep>'0.20 
c. p < 0.20 
d.p < 0.20 


Solution: 


C 


Exercise: 


Problem: 


Previously, an organization reported that teenagers spent 4.5 hours per 
week, on average, on the phone. The organization thinks that, 
currently, the mean is higher. Fifteen randomly chosen teenagers were 
asked how many hours per week they spend on the phone. The sample 
mean was 4.75 hours with a sample standard deviation of 2.0. Conduct 
a hypothesis test. The null and alternative hypotheses are: 


:@=4.5,H,:x2>4.5 
>w24.5, Hg p<4.5 

> p= 4.75, Hg: p> 4.75 
>w=4.5, Hg p> 4.5 


Aa op 
Sais 
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Glossary 


Hypothesis 
a statement about the value of a population parameter, in case of two 
hypotheses, the statement assumed to be true is called the null 
hypothesis (notation Hg) and the contradictory statement is called the 
alternative hypothesis (notation H,). 


Outcomes and the Type I and Type II Errors 


When you perform a hypothesis test, there are four possible outcomes 
depending on the actual truth (or falseness) of the null hypothesis Hy and 
the decision to reject or not. The outcomes are summarized in the following 
table: 


Statistical Decision Hp is actually... 

True False 
Cannot reject Ho Correct outcome Type II error 
Cannot accept Hg Type I error Correct outcome 


The four possible outcomes in the table are: 


1. The decision is cannot reject Hp when Ho is true (correct decision). 
2. The decision is cannot accept Hp when Hp is true (incorrect decision 
known as aType I error). This case is described as "rejecting a good 

null". As we will see later, it is this type of error that we will guard 
against by setting the probability of making such an error. The goal is 
to NOT take an action that is an error. 

3. The decision is cannot reject Hg when, in fact, Ho is false (incorrect 
decision known as a Type II error). This is called "accepting a false 
null". In this situation you have allowed the status quo to remain in 
force when it should be overturned. As we will see, the null hypothesis 
has the advantage in competition with the alternative. 

4. The decision is cannot accept Hp when Hp is false (correct decision). 


Each of the errors occurs with a particular probability. The Greek letters a 
and f represent the probabilities. 


a = probability of a Type I error = P(Type I error) = probability of 
rejecting the null hypothesis when the null hypothesis is true: rejecting a 
good null. 


B = probability of a Type II error = P(Type II error) = probability of not 
rejecting the null hypothesis when the null hypothesis is false. (1 — f) is 
called the Power of the Test. 


a and 6 should be as small as possible because they are probabilities of 
errors. 


Statistics allows us to set the probability that we are making a Type I error. 
The probability of making a Type I error is a. Recall that the confidence 
intervals in the last unit were set by choosing a value called Z, (or t,) and 
the alpha value determined the confidence level of the estimate because it 
was the probability of the interval failing to capture the true mean (or 
proportion parameter p). This alpha and that one are the same. 


The easiest way to see the relationship between the alpha error and the level 
of confidence is with the following figure. 
HH 


H,: Hp = 100 
H,: Up # 100 


In the center of [link] is a normally distributed sampling distribution 


marked Ho. This is a sampling distribution of X and by the Central Limit 
Theorem it is normally distributed. The distribution in the center is marked 
Ho and represents the distribution for the null hypotheses Ho: p = 100. This 
is the value that is being tested. The formal statements of the null and 
alternative hypotheses are listed below the figure. 


The distributions on either side of the Hg distribution represent distributions 
that would be true if Hg is false, under the alternative hypothesis listed as 
H,. We do not know which is true, and will never know. There are, in fact, 
an infinite number of distributions from which the data could have been 
drawn if H, is true, but only two of them are on [link] representing all of the 
others. 


To test a hypothesis we take a sample from the population and determine if 
it could have come from the hypothesized distribution with an acceptable 
level of significance. This level of significance is the alpha error and is 
marked on [link] as the shaded areas in each tail of the Hp distribution. 
(Each area is actually a/2 because the distribution is symmetrical and the 
alternative hypothesis allows for the possibility for the value to be either 
greater than or less than the hypothesized value--called a two-tailed test). 


If the sample mean marked as X is in the tail of the distribution of Hp, we 
conclude that the probability that it could have come from the Hp 
distribution is less than alpha. We consequently state, "the null hypothesis 
cannot be accepted with (a) level of significance". The truth may be that 


this X, did come from the Hg distribution, but from out in the tail. If this is 
so then we have falsely rejected a true null hypothesis and have made a 
Type I error. What statistics has done is provide an estimate about what we 
know, and what we control, and that is the probability of us being wrong, a. 


We can also see in [link] that the sample mean could be really from an H, 
distribution, but within the boundary set by the alpha level. Such a case is 


marked as X>. There is a probability that X» actually came from H, but 
shows up in the range of Hg between the two tails. This probability is the 
beta error, the probability of accepting a false null. 


Our problem is that we can only set the alpha error because there are an 
infinite number of alternative distributions from which the mean could have 
come that are not equal to Hp. As a result, the statistician places the burden 
of proof on the alternative hypothesis. That is, we will not reject a null 
hypothesis unless there is a greater than 90, or 95, or even 99 percent 
probability that the null is false: the burden of proof lies with the alternative 
hypothesis. This is why we called this the tyranny of the status quo earlier. 


By way of example, the American judicial system begins with the concept 
that a defendant is "presumed innocent". This is the status quo and is the 
null hypothesis. The judge will tell the jury that they can not find the 
defendant guilty unless the evidence indicates guilt beyond a "reasonable 
doubt" which is usually defined in criminal cases as 95% certainty of guilt. 
If the jury cannot accept the null, innocent, then action will be taken, jail 
time. The burden of proof always lies with the alternative hypothesis. (In 
civil cases, the jury needs only to be more than 50% certain of wrongdoing 
to find culpability, called "a preponderance of the evidence"). 


The example above was for a test of a mean, but the same logic applies to 
tests of hypotheses for all statistical parameters one may wish to test. 


The following are examples of Type I and Type II errors. 


Example: 

Suppose the null hypothesis, Ho, is: Frank's rock climbing equipment is 
safe. 

Type I error: Frank thinks that his rock climbing equipment may not be 
safe when, in fact, it really is safe. 

Type II error: Frank thinks that his rock climbing equipment may be safe 
when, in fact, it is not safe. 

a = probability that Frank thinks his rock climbing equipment may not be 
safe when, in fact, it really is safe. B = probability that Frank thinks his 
rock climbing equipment may be safe when, in fact, it is not safe. 

Notice that, in this case, the error with the greater consequence is the Type 
II error. (If Frank thinks his rock climbing equipment is safe, he will go 
ahead and use it.) 

This is a situation described as "accepting a false null”. 


Example: 

Suppose the null hypothesis, Ho, is: The victim of an automobile accident 
is alive when he arrives at the emergency room of a hospital. This is the 
status quo and requires no action if it is true. If the null hypothesis cannot 


be accepted then action is required and the hospital will begin appropriate 
procedures. 

Type I error: The emergency crew thinks that the victim is dead when, in 
fact, the victim is alive. Type II error: The emergency crew does not 
know if the victim is alive when, in fact, the victim is dead. 

a = probability that the emergency crew thinks the victim is dead when, in 
fact, he is really alive = P(Type I error). B = probability that the 
emergency crew does not know if the victim is alive when, in fact, the 
victim is dead = P(Type II error). 

The error with the greater consequence is the Type I error. (If the 
emergency crew thinks the victim is dead, they will not treat him.) 


Note: 
Try It 
Exercise: 


Problem: 


Suppose the null hypothesis, Ho, is: a patient is not sick. Which type 
of error has the greater consequence, Type I or Type II? 


Solution: 


The error with the greater consequence is the Type II error: the patient 
will be thought well when, in fact, he is sick, so he will not get 
treatment. 


Example: 

It’s a Boy Genetic Labs claim to be able to increase the likelihood that a 
pregnancy will result in a boy being born. Statisticians want to test the 
claim. Suppose that the null hypothesis, Ho, is: It’s a Boy Genetic Labs has 
no effect on gender outcome. The status quo is that the claim is false. The 
burden of proof always falls to the person making the claim, in this case 
the Genetics Lab. 


Type I error: This results when a true null hypothesis is rejected. In the 
context of this scenario, we would state that we believe that It’s a Boy 
Genetic Labs influences the gender outcome, when in fact it has no effect. 
The probability of this error occurring is denoted by the Greek letter alpha, 
a. 

Type II error: This results when we fail to reject a false null hypothesis. In 
context, we would state that It’s a Boy Genetic Labs does not influence the 
gender outcome of a pregnancy when, in fact, it does. The probability of 
this error occurring is denoted by the Greek letter beta, /. 

The error of greater consequence would be the Type I error since couples 
would use the It’s a Boy Genetic Labs product in hopes of increasing the 
chances of having a boy. 


Note: 
Try It 
Exercise: 


Problem: 


“Red tide” is a bloom of poison-producing algae—a few different 
species of a class of plankton called dinoflagellates. When the 
weather and water conditions cause these blooms, shellfish such as 
clams living in the area develop dangerous levels of a paralysis- 
inducing toxin. In Massachusetts, the Division of Marine Fisheries 
(DMF) monitors levels of the toxin in shellfish by regular sampling of 
shellfish along the coastline. If the mean level of toxin in clams 
exceeds 800 pig (micrograms) of toxin per kg of clam meat in any 
area, clam harvesting is banned there until the bloom is over and 
levels of toxin in clams subside. Describe both a Type I and a Type II 
error in this context, and state which error has the greater 
consequence. 


Solution: 


In this scenario, an appropriate null hypothesis would beHg: the mean 
level of toxins is at most 800 pg, Ho : Lo < 800 pg. 


Type I error: The DMF believes that toxin levels are still too high 
when, in fact, toxin levels are at most 800 pg. The DMF continues the 
harvesting ban. 


Type II error: The DMF believes that toxin levels are within 
acceptable levels (are at least 800 pg) when, in fact, toxin levels are 
still too high (more than 800 pg). The DMF lifts the harvesting ban. 
This error could be the most serious. If the ban is lifted and clams are 
still toxic, consumers could possibly eat tainted food. 


In summary, the more dangerous error would be to commit a Type II 
error, because this error involves the availability of tainted clams for 
consumption. 


Example: 

A certain experimental drug claims a cure rate of at least 75% for males 
with prostate cancer. Describe both the Type I and Type II errors in 
context. Which error is the more serious? 

Type I: A cancer patient believes the cure rate for the drug is less than 75% 
when it actually is at least 75%. 

Type II: A cancer patient believes the experimental drug has at least a 75% 
cure rate when it has a cure rate that is less than 75%. 

In this scenario, the Type II error contains the more severe consequence. If 
a patient believes the drug works at least 75% of the time, this most likely 
will influence the patient’s (and doctor’s) choice about whether to use the 
drug as a treatment option. 


Chapter Review 


In every hypothesis test, the outcomes are dependent on a correct 
interpretation of the data. Incorrect calculations or misunderstood summary 
statistics can yield errors that affect the results. A Type I error occurs when 


a true null hypothesis is rejected. A Type II error occurs when a false null 
hypothesis is not rejected. 


The probabilities of these errors are denoted by the Greek letters a and f, 
for a Type I and a Type II error respectively. The power of the test, 1 — f, 
quantifies the likelihood that a test will yield the correct result of a true 
alternative hypothesis being accepted. A high power is desirable. 
Exercise: 


Problem: 
The mean price of mid-sized cars in a region is $32,000. A test is 


conducted to see if the claim is true. State the Type I and Type II errors 
in complete sentences. 


Solution: 


Type I: The mean price of mid-sized cars is $32,000, but we conclude 
that it is not $32,000. 


Type II: The mean price of mid-sized cars is not $32,000, but we 
conclude that it is $32,000. 
Exercise: 
Problem: 
A sleeping bag is tested to withstand temperatures of —15 °F. You think 


the bag cannot stand temperatures that low. State the Type I and Type 
II errors in complete sentences. 


Exercise: 


Problem: For Exercise 9.12, what are a and B in words? 


Solution: 


a = the probability that you think the bag cannot withstand -15 degrees 
F, when in fact it can 


f = the probability that you think the bag can withstand -15 degrees F, 
when in fact it cannot 


Exercise: 


Problem: In words, describe 1 — 6 For Exercise 9.12. 
Exercise: 
Problem: 
A group of doctors is deciding whether or not to perform an operation. 


Suppose the null hypothesis, Hp, is: the surgical procedure will go 
well. State the Type I and Type II errors in complete sentences. 


Solution: 
Type I: The procedure will go well, but the doctors think it will not. 


Type I: The procedure will not go well, but the doctors think it will. 
Exercise: 

Problem: 

A group of doctors is deciding whether or not to perform an operation. 


Suppose the null hypothesis, Ho, is: the surgical procedure will go 
well. Which is the error with the greater consequence? 


Exercise: 


Problem: 
The power of a test is 0.981. What is the probability of a Type II error? 
Solution: 


0.019 


Exercise: 


Problem: 


A group of divers is exploring an old sunken ship. Suppose the null 
hypothesis, Ho, is: the sunken ship does not contain buried treasure. 
State the Type I and Type II errors in complete sentences. 


Exercise: 
Problem: 
A microbiologist is testing a water sample for E-coli. Suppose the null 
hypothesis, Ho, is: the sample does not contain E-coli. The probability 
that the sample does not contain E-coli, but the microbiologist thinks it 
does is 0.012. The probability that the sample does contain E-coli, but 


the microbiologist thinks it does not is 0.002. What is the power of this 
test? 


Solution: 


0.998 
Exercise: 


Problem: 

A microbiologist is testing a water sample for E-coli. Suppose the null 
hypothesis, Ho, is: the sample contains E-coli. Which is the error with 
the greater consequence? 


Homework 


Exercise: 


Problem: 


State the Type I and Type II errors in complete sentences given the 
following statements. 


a. The mean number of years Americans work before retiring is 34. 


ed © =e 


. At most 60% of Americans vote in presidential elections. 
. The mean starting salary for San Jose State University graduates 


is at least $100,000 per year. 


. Twenty-nine percent of high school seniors get drunk each month. 
. Fewer than 5% of adults ride the bus to work in Los Angeles. 
. The mean number of cars a person owns in his or her lifetime is 


not more than ten. 


. About half of Americans prefer to live away from cities, given the 


choice. 


. Europeans have a mean paid vacation each year of six weeks. 
. The chance of developing breast cancer is under 11% for women. 
. Private universities mean tuition cost is more than $20,000 per 


year. 


Solution: 


a. 


e. 


Type I error: We conclude that the mean is not 34 years, when it 
really is 34 years. Type II error: We conclude that the mean is 34 
years, when in fact it really is not 34 years. 


. Type I error: We conclude that more than 60% of Americans vote 


in presidential elections, when the actual percentage is at most 
60%.Type II error: We conclude that at most 60% of Americans 
vote in presidential elections when, in fact, more than 60% do. 


. Type I error: We conclude that the mean starting salary is less 


than $100,000, when it really is at least $100,000. Type II error: 
We conclude that the mean starting salary is at least $100,000 
when, in fact, it is less than $100,000. 


. Type I error: We conclude that the proportion of high school 


seniors who get drunk each month is not 29%, when it really is 
29%. Type II error: We conclude that the proportion of high 
school seniors who get drunk each month is 29% when, in fact, it 
is not 29%. 

Type I error: We conclude that fewer than 5% of adults ride the 
bus to work in Los Angeles, when the percentage that do is really 
5% or more. Type II error: We conclude that 5% or more adults 


ride the bus to work in Los Angeles when, in fact, fewer that 5% 
do. 

. Type I error: We conclude that the mean number of cars a person 
owns in his or her lifetime is more than 10, when in reality it is 
not more than 10. Type II error: We conclude that the mean 
number of cars a person owns in his or her lifetime is not more 
than 10 when, in fact, it is more than 10. 

g. Type I error: We conclude that the proportion of Americans who 
prefer to live away from cities is not about half, though the actual 
proportion is about half. Type II error: We conclude that the 
proportion of Americans who prefer to live away from cities is 
half when, in fact, it is not half. 

h. Type I error: We conclude that the duration of paid vacations each 
year for Europeans is not six weeks, when in fact it is six weeks. 
Type II error: We conclude that the duration of paid vacations 
each year for Europeans is six weeks when, in fact, it is not. 

. Type I error: We conclude that the proportion is less than 11%, 
when it is really at least 11%. Type II error: We conclude that the 
proportion of women who develop breast cancer is at least 11%, 
when in fact it is less than 11%. 

j. Type I error: We conclude that the average tuition cost at private 
universities is more than $20,000, though in reality it is at most 
$20,000. Type II error: We conclude that the average tuition cost 
at private universities is at most $20,000 when, in fact, it is more 
than $20,000. 


Hh 
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Exercise: 


Problem: 


For statements a-j in Exercise 9.109, answer the following in complete 
sentences. 


a. State a consequence of committing a Type I error. 
b. State a consequence of committing a Type II error. 


Exercise: 


Problem: 


When a new drug is created, the pharmaceutical company must subject 
it to testing before receiving the necessary permission from the Food 
and Drug Administration (FDA) to market the drug. Suppose the null 
hypothesis is “the drug is unsafe.” What is the Type II Error? 


a. To conclude the drug is safe when in, fact, it is unsafe. 

b. Not to conclude the drug is safe when, in fact, it is safe. 

c. To conclude the drug is safe when, in fact, it is safe. 

d. Not to conclude the drug is unsafe when, in fact, it is unsafe. 


Solution: 


b 
Exercise: 


Problem: 


A statistics instructor believes that fewer than 20% of Evergreen 
Valley College (EVC) students attended the opening midnight showing 
of the latest Harry Potter movie. She surveys 84 of her students and 
finds that 11 of them attended the midnight showing. The Type I error 
is to conclude that the percent of EVC students who attended is 


a. at least 20%, when in fact, it is less than 20%. 
b. 20%, when in fact, it is 20%. 

c. less than 20%, when in fact, it is at least 20%. 
d. less than 20%, when in fact, it is less than 20%. 


Exercise: 


Problem: 


It is believed that Lake Tahoe Community College (LTCC) 
Intermediate Algebra students get less than seven hours of sleep per 
night, on average. A survey of 22 LTCC Intermediate Algebra students 
generated a mean of 7.24 hours with a standard deviation of 1.93 
hours. At a level of significance of 5%, do LTCC Intermediate Algebra 
students get less than seven hours of sleep per night, on average? 


The Type II error is not to reject that the mean number of hours of 
sleep LTCC students get per night is at least seven when, in fact, the 
mean number of hours 


a. is more than seven hours. 
b. is at most seven hours. 

c. is at least seven hours. 

d. is less than seven hours. 


Solution: 


d 
Exercise: 


Problem: 


Previously, an organization reported that teenagers spent 4.5 hours per 
week, on average, on the phone. The organization thinks that, 
currently, the mean is higher. Fifteen randomly chosen teenagers were 
asked how many hours per week they spend on the phone. The sample 
mean was 4.75 hours with a sample standard deviation of 2.0. Conduct 
a hypothesis test, the Type I error is: 


a. to conclude that the current mean hours per week is higher than 
4.5, when in fact, it is higher 

b. to conclude that the current mean hours per week is higher than 
4.5, when in fact, it is the same 


c. to conclude that the mean hours per week currently is 4.5, when 
in fact, it is higher 

d. to conclude that the mean hours per week currently is no higher 
than 4.5, when in fact, it is not higher 


Glossary 


Type I Error 
The decision is to reject the null hypothesis when, in fact, the null 
hypothesis is true. 


Type II Error 
The decision is not to reject the null hypothesis when, in fact, the null 
hypothesis is false. 


Distribution Needed for Hypothesis Testing 


Earlier, we discussed sampling distributions. Particular distributions are 
associated with hypothesis testing.We will perform hypotheses tests of a 
population mean using a normal distribution or a Student's t-distribution. 
(Remember, use a Student's t-distribution when the population standard 
deviation is unknown and the sample size is small, where small is 
considered to be less than 30 observations.) We perform tests of a 
population proportion using a normal distribution when we can assume that 
the distribution is normally distributed. We consider this to be true if the 
sample proportion, p/, times the sample size is greater than 5 and 1-p/ times 
the sample size is also greater then 5. This is the same rule of thumb we 
used when developing the formula for the confidence interval for a 
population proportion. 


Hypothesis Test for the Mean 


Going back to the standardizing formula we can derive the test statistic for 
testing hypotheses concerning means. 
Equation: 


Ve xv — Ho 


a/v 


The standardizing formula can not be solved as it is because we do not have 
ul, the population mean. However, if we substitute in the hypothesized value 
of the mean, [lg in the formula as above, we can compute a Z value. This is 
the test statistic for a test of hypothesis for a mean and is presented in [link]. 
We interpret this Z value as the associated probability that a sample with a 
sample mean of X could have come from a distribution with a population 
mean of Ho and we call this Z value Z, for “calculated”. [link] and [link] 
show this process. 


In [link] two of the three possible outcomes are presented. X, and X3 are in 
the tails of the hypothesized distribution of Ho. Notice that the horizontal 


axis in the top panel is labeled X's. This is the same theoretical distribution 


of X's, the sampling distribution, that the Central Limit Theorem tells us is 
normally distributed. This is why we can draw it with this shape. The 
horizontal axis of the bottom panel is labeled Z and is the standard normal 
distribution. Z a and -La, called the critical values, are marked on the 


bottom panel as the Z values associated with the probability the analyst has 
set as the level of significance in the test, (a). The probabilities in the tails 
of both panels are, therefore, the same. 


Notice that for each X there is an associated Z,, called the calculated Z, that 
comes from solving the equation above. This calculated Z is nothing more 
than the number of standard deviations that the hypothesized mean is from 
the sample mean. If the sample mean falls "too many" standard deviations 


from the hypothesized mean we conclude that the sample mean could not 
have come from the distribution with the hypothesized mean, given our pre- 
set required level of significance. It could have come from Hp, but it is 
deemed just too unlikely. In [link] both X, and X3 are in the tails of the 
distribution. They are deemed "too far" from the hypothesized value of the 
mean given the chosen level of alpha. If in fact this sample mean it did 
come from Ho, but from in the tail, we have made a Type I error: we have 
rejected a good null. Our only real comfort is that we know the probability 
of making such an error, a, and we can control the size of a. 


[link] shows the third possibility for the location of the sample mean, x. 
Here the sample mean is within the two critical values. That is, within the 
probability of (1-a) and we cannot reject the null hypothesis. 


This gives us the decision rule for testing a hypothesis for a two-tailed test: 


Decision rule: two-tail test 
If Z. < Za : then cannot REJECT Ho 


IfZ.> Ze : then cannot ACCEPT Ho 


This rule will always be the same no matter what hypothesis we are testing 
or what formulas we are using to make the test. The only change will be to 
change the Z,. to the appropriate symbol for the test statistic for the 
parameter being tested. Stating the decision rule another way: if the sample 
mean is unlikely to have come from the distribution with the hypothesized 
mean we cannot accept the null hypothesis. Here we define "unlikely" as 
having a probability less than alpha of occurring. 


P-Value Approach 


An alternative decision rule can be developed by calculating the probability 
that a sample mean could be found that would give a test statistic larger 
than the test statistic found from the current sample data assuming that the 
null hypothesis is true. Here the notion of "likely" and "unlikely" is defined 
by the probability of drawing a sample with a mean from a population with 
the hypothesized mean that is either larger or smaller than that found in the 
sample data. Simply stated, the p-value approach compares the desired 
significance level, a, to the p-value which is the probability of drawing a 
sample mean further from the hypothesized value than the actual sample 
mean. A large p-value calculated from the data indicates that we should not 
reject the null hypothesis. ‘The smaller the p-value, the more unlikely the 
outcome, and the stronger the evidence is against the null hypothesis. We 
would reject the null hypothesis if the evidence is strongly against it. The 


relationship between the decision rule of comparing the calculated test 
Statistics, Z,, and the Critical Value, Z, , and using the p-value can be seen 
in [link]. 


The calculated value of the test statistic is Z, in this example and is marked 
on the bottom graph of the standard normal distribution because it is a Z 
value. In this case the calculated value is in the tail and thus we cannot 


accept the null hypothesis, the associated X is just too unusually large to 
believe that it came from the distribution with a mean of [ip with a 
significance level of a. 


If we use the p-value decision rule we need one more step. We need to find 
in the standard normal table the probability associated with the calculated 
test statistic, Z.. We then compare that to the a associated with our selected 
level of confidence. In [link] we see that the p-value is less than a and 
therefore we cannot accept the null. We know that the p-value is less than a 
because the area under the p-value is smaller than o/2. It is important to 


note that two researchers drawing randomly from the same population may 
find two different P-values from their samples. This occurs because the P- 
value is calculated as the probability in the tail beyond the sample mean 
assuming that the null hypothesis is correct. Because the sample means will 
in all likelihood be different this will create two different P-values. 
Nevertheless, the conclusions as to the null hypothesis should be different 
with only the level of probability of a. 


Here is a systematic way to make a decision of whether you cannot accept 
or cannot reject a null hypothesis if using the p-value and a preset or 
preconceived a (the "significance level"). A preset a is the probability of a 
Type I error (rejecting the null hypothesis when the null hypothesis is true). 
It may or may not be given to you at the beginning of the problem. In any 
case, the value of a is the decision of the analyst. When you make a 
decision to reject or not reject Ho, do as follows: 


e If a> p-value, cannot accept Ho. The results of the sample data are 
significant. There is sufficient evidence to conclude that Ho is an 
incorrect belief and that the alternative hypothesis, H,, may be 
correct. 

e If a < p-value, cannot reject Ho. The results of the sample data are not 
significant. There is not sufficient evidence to conclude that the 
alternative hypothesis, H,, may be correct. In this case the status quo 
stands. 

e When you "cannot reject Hj", it does not mean that you should believe 
that Ho is true. It simply means that the sample data have failed to 
provide sufficient evidence to cast serious doubt about the truthfulness 
of Ho. Remember that the null is the status quo and it takes high 
probability to overthrow the status quo. This bias in favor of the null 
hypothesis is what gives rise to the statement "tyranny of the status 
quo" when discussing hypothesis testing and the scientific method. 


Both decision rules will result in the same decision and it is a matter of 
preference which one is used. 


One and Two-tailed Tests 


The discussion of [link]-[link] was based on the null and alternative 
hypothesis presented in [link]. This was called a two-tailed test because the 
alternative hypothesis allowed that the mean could have come from a 
population which was either larger or smaller than the hypothesized mean 
in the null hypothesis. This could be seen by the statement of the alternative 
hypothesis as p # 100, in this example. 


It may be that the analyst has no concern about the value being "too" high 
or "too" low from the hypothesized value. If this is the case, it becomes a 
one-tailed test and all of the alpha probability is placed in just one tail and 
not split into a/2 as in the above case of a two-tailed test. Any test of a 
claim will be a one-tailed test. For example, a car manufacturer claims that 
their Model 17B provides gas mileage of greater than 25 miles per gallon. 
The null and alternative hypothesis would be: 


e Hop: ps 25 
e- Hae 25 


The claim would be in the alternative hypothesis. The burden of proof in 
hypothesis testing is carried in the alternative. This is because failing to 
reject the null, the status quo, must be accomplished with 90 or 95 percent 
significance that it cannot be maintained. Said another way, we want to 
have only a 5 or 10 percent probability of making a Type I error, rejecting a 
good null; overthrowing the status quo. 


This is a one-tailed test and all of the alpha probability is placed in just one 
tail and not split into a/2 as in the above case of a two-tailed test. 


[link] shows the two possible cases and the form of the null and alternative 
hypothesis that give rise to them. 


io HS a eal | feng 1 
nS, H.7 UW <H, 


where [Up is the hypothesized value of the population mean. 


Sample size Test statistic 
< 30 t.= xX —Ho 
(o unknown) © s/n 

< 30 2 xX —Ho 
(o known) © o/vn 
> 30 Vr X —Ho 
(o unknown) s/n 

> 30 Vr xX —Ho 
(o known) © af/vn 


Test Statistics for Test of Means, Varying Sample Size, Population Standard 
Deviation Known or Unknown 


Effects of Sample Size on Test Statistic 


In developing the confidence intervals for the mean from a sample, we 
found that most often we would not have the population standard deviation, 
o. If the sample size were larger than 30, we could simply substitute the 
point estimate for o, the sample standard deviation, s, and use the student's t 
distribution to correct for this lack of information. 


When testing hypotheses we are faced with this same problem and the 
solution is exactly the same. Namely: If the population standard deviation is 
unknown, and the sample size is less than 30, substitute s, the point estimate 
for the population standard deviation, o, in the formula for the test statistic 
and use the student's t distribution. All the formulas and figures above are 
unchanged except for this substitution and changing the Z distribution to the 
student's t distribution on the graph. Remember that the student's t 
distribution can only be computed knowing the proper degrees of freedom 
for the problem. In this case, the degrees of freedom is computed as before 
with confidence intervals: df = (n-1). The calculated t-value is compared to 
the t-value associated with the pre-set level of confidence required in the 
test, t,, qe found in the student's t tables. If we do not know o, but the 

sample size is 30 or more, we simply substitute s for o and use the normal 
distribution. 


[link] summarizes these rules. 


A Systematic Approach for Testing A Hypothesis 


A systematic approach to hypothesis testing follows the following steps and 
in this order. This template will work for all hypotheses that you will ever 
test. 


e Set up the null and alternative hypothesis. This is typically the hardest 
part of the process. Here the question being asked is reviewed. What 


parameter is being tested, a mean, a proportion, differences in means, 
etc. Is this a one-tailed test or two-tailed test? Remember, if someone 
is making a claim it will always be a one-tailed test. 


Decide the level of significance required for this particular case and 
determine the critical value. These can be found in the appropriate 
statistical table. The levels of confidence typical for the social sciences 
are 90, 95 and 99. However, the level of significance is a policy 
decision and should be based upon the risk of making a Type I error, 
rejecting a good null. Consider the consequences of making a Type I 
elror. 


Next, on the basis of the hypotheses and sample size, select the 
appropriate test statistic and find the relevant critical value: Z,, tg, etc. 
Drawing the relevant probability distribution and marking the critical 
value is always big help. Be sure to match the graph with the 
hypothesis, especially if it is a one-tailed test. 

Take a sample(s) and calculate the relevant parameters: sample mean, 
standard deviation, or proportion. Using the formula for the test 
Statistic from above in step 2, now calculate the test statistic for this 
particular case using the parameters you have just calculated. 
Compare the calculated test statistic and the critical value. Marking 
these on the graph will give a good visual picture of the situation. 
There are now only two situations: 


a. The test statistic is in the tail: Cannot Accept the null, the 
probability that this sample mean (proportion) came from the 
hypothesized distribution is too small to believe that it is the real 
home of these sample data. 

b. The test statistic is not in the tail: Cannot Reject the null, the 
sample data are compatible with the hypothesized population 
parameter. 


Reach a conclusion. It is best to articulate the conclusion two different 
ways. First a formal statistical conclusion such as “With a 95 % level 
of significance we cannot accept the null hypotheses that the 
population mean is equal to XX (units of measurement)”. The second 
statement of the conclusion is less formal and states the action, or lack 


of action, required. If the formal conclusion was that above, then the 
informal one might be, “The machine is broken and we need to shut it 
down and call for repairs”. 


All hypotheses tested will go through this same process. The only changes 
are the relevant formulas and those are determined by the hypothesis 
required to answer the original question. 


Chapter Review 


In order for a hypothesis test’s results to be generalized to a population, 
certain requirements must be satisfied. 


When testing for a single population mean: 


1. A Student's t-test should be used if the data come from a simple, 
random sample and the population is approximately normally 
distributed, or the sample size is large, with an unknown standard 
deviation. 

2. The normal test will work if the data come from a simple, random 
sample and the population is approximately normally distributed, or 
the sample size is large. 


When testing a single population proportion use a normal test for a single 
population proportion if the data comes from a simple, random sample, fill 
the requirements for a binomial distribution, and the mean number of 
success and the mean number of failures satisfy the conditions: np > 5 and 
ng > n where n is the sample size, p is the probability of a success, and q is 
the probability of a failure. 


Formula Review 


Sample size Test statistic 


< 30 t — X—pMo 

(o unknown) C s/n 

< 30 Vee X—p0 
(o known) c o/vn 
> 30 Fe X—p0 
(o unknown) C s/n 
> 30 Fi X—p0 
(o known) C a/yn 


Test Statistics for Test of Means, Varying Sample Size, Population Known 
or Unknown 
Exercise: 

Problem: 


Which two distributions can you use for hypothesis testing for this 
chapter? 


Solution: 


A normal distribution or a Student’s t-distribution 
Exercise: 
Problem: 
Which distribution do you use when you are testing a population mean 


and the population standard deviation is known? Assume sample size 
is large. Assume a normal distribution with n = 30. 


Exercise: 


Problem: 


Which distribution do you use when the standard deviation is not 
known and you are testing one population mean? Assume a normal 
distribution, with n > 30. 


Solution: 


Use a Student’s t-distribution 
Exercise: 
Problem: 
A population mean is 13. The sample mean is 12.8, and the sample 
standard deviation is two. The sample size is 20. What distribution 


should you use to perform a hypothesis test? Assume the underlying 
population is normal. 


Exercise: 
Problem: 
A population has a mean is 25 and a standard deviation of five. The 


sample mean is 24, and the sample size is 108. What distribution 
should you use to perform a hypothesis test? 


Solution: 


a normal distribution for a single population mean 
Exercise: 
Problem: 
It is thought that 42% of respondents in a taste test would prefer Brand 


A. In a particular test of 100 people, 39% preferred Brand A. What 
distribution should you use to perform a hypothesis test? 


Exercise: 


Problem: 


You are performing a hypothesis test of a single population mean using 
a Student’s t-distribution. What must you assume about the distribution 
of the data? 


Solution: 


It must be approximately normally distributed. 
Exercise: 
Problem: 
You are performing a hypothesis test of a single population mean using 


a Student’s t-distribution. The data are not from a simple random 
sample. Can you accurately perform the hypothesis test? 


Exercise: 
Problem: 


You are performing a hypothesis test of a single population proportion. 
What must be true about the quantities of np and nq? 


Solution: 


They must both be greater than five. 
Exercise: 
Problem: 
You are performing a hypothesis test of a single population proportion. 


You find out that np is less than five. What must you do to be able to 
perform a valid hypothesis test? 


Exercise: 


Problem: 


You are performing a hypothesis test of a single population proportion. 
The data come from which distribution? 


Solution: 


binomial distribution 


Homework 


Exercise: 


Problem: 


It is believed that Lake Tahoe Community College (LTCC) 
Intermediate Algebra students get less than seven hours of sleep per 
night, on average. A survey of 22 LTCC Intermediate Algebra students 
generated a mean of 7.24 hours with a standard deviation of 1.93 
hours. At a level of significance of 5%, do LTCC Intermediate Algebra 
students get less than seven hours of sleep per night, on average? The 


distribution to be used for this test is _X ~ 


a. N(7.24, +22) 


) /22 
b. N(7.24, 1.93) 
C. 199 
d. to4 
Solution: 
d 
Glossary 


Binomial Distribution 


a discrete random variable (RV) that arises from Bernoulli trials. There 
are a fixed number, n, of independent trials. “Independent” means that 
the result of any trial (for example, trial 1) does not affect the results of 
the following trials, and all trials are conducted under the same 
conditions. Under these circumstances the binomial RV X is defined as 
the number of successes in n trials. The notation is: X ~ B(n, p) up = np 
and the standard deviation is a = ,/npq. The probability of exactly x 


n 
successes in n trials is P(X = x) = ( ) pq” *. 
ny 


Normal Distribution 


a continuous random variable (RV) with pdf f(x) = =e, 

oO TT 
where p/ is the mean of the distribution, and o is the standard deviation, 
notation: X ~ N(p, 0). If uy = 0 and o = 1, the RV is called the standard 


normal distribution. 


Standard Deviation 
a number that is equal to the square root of the variance and measures 
how far data values are from their mean; notation: s for sample 
standard deviation and o for population standard deviation. 


Student's t-Distribution 
investigated and reported by William S. Gossett in 1908 and published 
under the pseudonym Student. The major characteristics of the random 
variable (RV) are: 


e It is continuous and assumes any real values. 

e The pdf is symmetrical about its mean of zero. However, it is 
more spread out and flatter at the apex than the normal 
distribution. 

e It approaches the standard normal distribution as n gets larger. 

e There is a "family" of t distributions: every representative of the 
family is completely defined by the number of degrees of 
freedom which is one less than the number of data items. 


Test Statistic 


The formula that counts the number of standard deviations on the 
relevant distribution that estimated parameter is away from the 
hypothesized value. 


Critical Value 
The t or Z value set by the researcher that measures the probability of a 
Type I error, a. 


Full Hypothesis Test Examples 


Tests on Means 


Example: 
Exercise: 


Problem: 


Jeffrey, as an eight-year old, established a mean time of 16.43 
seconds for swimming the 25-yard freestyle, with a standard 
deviation of 0.8 seconds. His dad, Frank, thought that Jeffrey could 
swim the 25-yard freestyle faster using goggles. Frank bought Jeffrey 
a new pair of expensive goggles and timed Jeffrey for 15 25-yard 
freestyle swims. For the 15 swims, Jeffrey's mean time was 16 
seconds. Frank thought that the goggles helped Jeffrey to swim 
faster than the 16.43 seconds. Conduct a hypothesis test using a 
preset a = 0.05. 


Solution: 
Set up the Hypothesis Test: 


Since the problem is about a mean, this is a test of a single 
population mean. 


Set the null and alternative hypothesis: 


In this case there is an implied challenge or claim. This is that the 
goggles will reduce the swimming time. The effect of this is to set the 
hypothesis as a one-tailed test. The claim will always be in the 
alternative hypothesis because the burden of proof always lies with 
the alternative. Remember that the status quo must be defeated with a 
high degree of confidence, in this case 95 % confidence. The null and 
alternative hypotheses are thus: 


Ho: p>16.43 Hg: p< 16.43 


For Jeffrey to swim faster, his time will be less than 16.43 seconds. 
The "<" tells you this is left-tailed. 


Determine the distribution needed: 
Random variable: X = the mean time to swim the 25-yard freestyle. 
Distribution for the test statistic: 


The sample size is less than 30 and we do not know the population 
standard deviation so this is a t-test. and the proper formula is: 
ip = X—Ho 


o//n 


Ho = 16.43 comes from Hp and not the data. X=16.s= 0.8, andn= 
Sy 


Our step 2, setting the level of significance, has already been 
determined by the problem, .05 for a 95 % significance level. It is 
worth thinking about the meaning of this choice. The Type I error is to 
conclude that Jeffrey swims the 25-yard freestyle, on average, in less 
than 16.43 seconds when, in fact, he actually swims the 25-yard 
freestyle, on average, in 16.43 seconds. (Reject the null hypothesis 
when the null hypothesis is true.) For this case the only concern with a 
Type I error would seem to be that Jeffery’s dad may fail to bet on his 
son’s victory because he does not have appropriate confidence in the 
effect of the goggles. 


To find the critical value we need to select the appropriate test 
statistic. We have concluded that this is a t-test on the basis of the 
sample size and that we are interested in a population mean. We can 
now draw the graph of the t-distribution and mark the critical value. 
For this problem the degrees of freedom are n-1, or 14. Looking up 14 
degrees of freedom at the 0.05 column of the t-table we find 1.761. 
This is the critical value and we can put this on our graph. 


Step 3 is the calculation of the test statistic using the formula we have 
selected. We find that the calculated test statistic is 2.08, meaning that 
the sample mean is 2.08 standard deviations away from the 
hypothesized mean of 16.43. 

Equation: 


Z—py  16—16.43 
l= oe 


- = — _2.08 
*/ Va 8/ 1B 


-2.08 -1.761 9 


H,: vp = 16.43 


Step 4 has us compare the test statistic and the critical value and mark these 
on the graph. We see that the test statistic is in the tail and thus we move to 
step 4 and reach a conclusion. The probability that an average time of 16 
minutes could come from a distribution with a population mean of 16.43 


minutes is too unlikely for us to accept the null hypothesis. We cannot 
accept the null. 


Step 5 has us state our conclusions first formally and then less formally. A 
formal conclusion would be stated as: “With a 95% level of significance we 
cannot accept the null hypothesis that the swimming time with goggles 
comes from a distribution with a population mean time of 16.43 minutes.” 
Less formally, “With 95% significance we believe that the goggles 
improves swimming speed” 


If we wished to use the p-value system of reaching a conclusion we would 
calculate the statistic and take the additional step to find the probability of 
being 2.08 standard deviations from the mean on a t-distribution. This value 
is .0187. Comparing this to the a-level of .05 we see that we cannot accept 
the null. The p-value has been put on the graph as the shaded area beyond 
-2.08 and it shows that it is smaller than the hatched area which is the alpha 
level of 0.05. Both methods reach the same conclusion that we cannot 
accept the null hypothesis. 


Note: 
Try It 
Exercise: 


Problem: 


The mean throwing distance of a football for Marco, a high school 
freshman quarterback, is 40 yards, with a standard deviation of two 
yards. The team coach tells Marco to adjust his grip to get more 
distance. The coach records the distances for 20 throws. For the 20 
throws, Marco’s mean distance was 45 yards. The coach thought the 
different grip helped Marco throw farther than 40 yards. Conduct a 
hypothesis test using a preset a = 0.05. Assume the throw distances 
for footballs are normal. 


First, determine what type of test this is, set up the hypothesis test, 
find the p-value, sketch the graph, and state your conclusion. 


Solution: 


Since the problem is about a mean, this is a test of a single population 
mean. 


Ho: p = 40 
Hg: p> 40 


p = 0.0062 


p-value 


x! 


40 45 


Because p < a, we reject the null hypothesis. There is sufficient 
evidence to suggest that the change in grip improved Marco’s 
throwing distance. 


Example: 
Exercise: 


Problem: 


Jane has just begun her new job as on the sales force of a very 
competitive company. In a sample of 16 sales calls it was found that 
she closed the contract for an average value of 108 dollars with a 
standard deviation of 12 dollars. Test at 5% significance that the 
population mean is at least 100 dollars against the alternative that it is 
less than 100 dollars. Company policy requires that new members of 
the sales force must exceed an average of $100 per contract during the 
trial employment period. Can we conclude that Jane has met this 
requirement at the significance level of 95%? 


Solution: 


ily Ho: H < 100 
He: w= 100 
The null and alternative hypothesis are for the parameter p 
because the number of dollars of the contracts is a continuous 
random variable. Also, this is a one-tailed test because the 
company has only an interested if the number of dollars per 
contact is below a particular number not "too high" a number. 
This can be thought of as making a claim that the requirement is 


being met and thus the claim is in the alternative hypothesis. 


2. Test statistic: t. = 42 = 18-100 — 967 


= 
= 
3. Critical value: tg = 1.753 with n-1 degrees of freedom= 15 


The test statistic is a Student's t because the sample size is below 30; 
therefore, we cannot use the normal distribution. Comparing the 
calculated value of the test statistic and the critical value of t (tq) at a 
5% significance level, we see that the calculated value is in the tail of 
the distribution. Thus, we conclude that 108 dollars per contract is 
significantly larger than the hypothesized value of 100 and thus we 
cannot accept the null hypothesis. There is evidence that supports 
Jane's performance meets company standards. 


Note: 
Try It 


Exercise: 


Problem: 


It is believed that a stock price for a particular company will grow at a 
rate of $5 per week with a standard deviation of $1. An investor 
believes the stock won’t grow as quickly. The changes in stock price 
is recorded for ten weeks and are as follows: $4, $3, $2, $3, $1, $7, 
$2, $1, $1, $2. Perform a hypothesis test using a 5% level of 
significance. State the null and alternative hypotheses, state your 
conclusion, and identify the Type I errors. 


Solution: 
Ao: p=5 
J ay pR ES 
p = 0.0082 


Because p < a, we reject the null hypothesis. There is sufficient 
evidence to suggest that the stock price of the company grows at a 
rate less than $5 a week. 


Type I Error: To conclude that the stock price is growing slower than 
$5 a week when, in fact, the stock price is growing at $5 a week 
(reject the null hypothesis when the null hypothesis is true). 


Type II Error: To conclude that the stock price is growing at a rate of 
$5 a week when, in fact, the stock price is growing slower than $5 a 
week (do not reject the null hypothesis when the null hypothesis is 
false). 


Example: 
Exercise: 


Problem: 


A manufacturer of salad dressings uses machines to dispense liquid 
ingredients into bottles that move along a filling line. The machine 
that dispenses salad dressings is working properly when 8 ounces are 
dispensed. Suppose that the average amount dispensed in a particular 
sample of 35 bottles is 7.91 ounces with a variance of 0.03 ounces 
squared, s*. Is there evidence that the machine should be stopped and 
production wait for repairs? The lost production from a shutdown is 
potentially so great that management feels that the level of 
significance in the analysis should be 99%. 


Again we will follow the steps in our analysis of this problem. 
Solution: 


STEP 1: Set the Null and Alternative Hypothesis. The random 
variable is the quantity of fluid placed in the bottles. This is a 
continuous random variable and the parameter we are interested in is 
the mean. Our hypothesis therefore is about the mean. In this case we 
are concerned that the machine is not filling properly. From what we 
are told it does not matter if the machine is over-filling or under- 
filling, both seem to be an equally bad error. This tells us that this is a 
two-tailed test: if the machine is malfunctioning it will be shutdown 
regardless if it is from over-filling or under-filling. The null and 
alternative hypotheses are thus: 

Equation: 


Ho : p= 8 
Equation: 
UG ati =e) 


STEP 2: Decide the level of significance and draw the graph showing 
the critical value. 


This problem has already set the level of significance at 99%. The 
decision seems an appropriate one and shows the thought process 
when setting the significance level. Management wants to be very 
certain, as certain as probability will allow, that they are not shutting 
down a machine that is not in need of repair. To draw the distribution 
and the critical value, we need to know which distribution to use. 
Because this is a continuous random variable and we are interested in 
the mean, and the sample size is greater than 30, the appropriate 
distribution is the normal distribution and the relevant critical value is 
2.575 from the normal table or the t-table at 0.005 column and infinite 
degrees of freedom. We draw the graph and mark these points. 


Hy 
% = 0.005 % = 0.005 
) 
UMA ry : 
Z,= -3.07 ) 2.575 
-2.575 


STEP 3: Calculate sample parameters and the test statistic. The 
sample parameters are provided, the sample mean is 7.91 and the 
sample variance is .03 and the sample size is 35. We need to note that 
the sample variance was provided not the sample standard deviation, 
which is what we need for the formula. Remembering that the 
standard deviation is simply the square root of the variance, we 
therefore know the sample standard deviation, s, is 0.173. With this 
information we calculate the test statistic as -3.07, and mark it on the 
graph. 

Equation: 


STEP 4: Compare test statistic and the critical values Now we 
compare the test statistic and the critical value by placing the test 
Statistic on the graph. We see that the test statistic is in the tail, 
decidedly greater than the critical value of 2.575. We note that even 
the very small difference between the hypothesized value and the 
sample value is still a large number of standard deviations. The 
sample mean is only 0.08 ounces different from the required level of 8 
ounces, but it is 3 plus standard deviations away and thus we cannot 
accept the null hypothesis. 


STEP 5: Reach a Conclusion 


Three standard deviations of a test statistic will guarantee that the test 
will fail. The probability that anything is within three standard 
deviations is almost zero. Actually it is 0.0026 on the normal 
distribution, which is certainly almost zero in a practical sense. Our 
formal conclusion would be “ At a 99% level of significance we 
cannot accept the hypothesis that the sample mean came from a 
distribution with a mean of 8 ounces” Or less formally, and getting to 
the point, “At a 99% level of significance we conclude that the 
machine is under filling the bottles and is in need of repair”. 


Hypothesis Test for Proportions 


Just as there were confidence intervals for proportions, or more formally, 
the population parameter p of the binomial distribution, there is the ability 
to test hypotheses concerning p. 


The population parameter for the binomial is p. The estimated value (point 
estimate) for p is p’ where p' = x/n, x is the number of successes in the 
sample and n is the sample size. 


When you perform a hypothesis test of a population proportion p, you take 
a simple random sample from the population. The conditions for a 
binomial distribution must be met, which are: there are a certain number n 


of independent trials meaning random sampling, the outcomes of any trial 
are binary, success or failure, and each trial has the same probability of a 
success p. The shape of the binomial distribution needs to be similar to the 
shape of the normal distribution. To ensure this, the quantities np' and nq' 
must both be greater than five (np' > 5 and nq’ > 5). In this case the 
binomial distribution of a sample (estimated) proportion can be 
approximated by the normal distribution with 4 = np and o = ,/npq. 
Remember that g = 1—p. There is no distribution that can correct for this 
small sample bias and thus if these conditions are not met we simply cannot 
test the hypothesis with the data available at that time. We met this 
condition when we first were estimating confidence intervals for p. 


Again, we begin with the standardizing formula modified because this is the 
distribution of a binomial. 
Equation: 


n 


Substituting po, the hypothesized value of p, we have: 
Equation: 


Vi Pp — Po 


Poo 
n 


This is the test statistic for testing hypothesized values of p, where the null 
and alternative hypotheses take one of the following forms: 


Two-tailed test One-tailed test One-tailed test 


Two-tailed test One-tailed test One-tailed test 
Ho: Pp = po Ho: p < po Ho: p 2 po 
Ha: p # po Ha: Pp > po Ha: p < po 


The decision rule stated above applies here also: if the calculated value of 
Z- shows that the sample proportion is "too many" standard deviations from 
the hypothesized proportion, the null hypothesis cannot be accepted. The 
decision as to what is "too many" is pre-determined by the analyst 
depending on the level of significance required in the test. 


Example: 
Exercise: 


Problem: 


The mortgage department of a large bank is interested in the nature of 
loans of first-time borrowers. This information will be used to tailor 
their marketing strategy. They believe that 50% of first-time 
borrowers take out smaller loans than other borrowers. They perform 
a hypothesis test to determine if the percentage is the same or 
different from 50%. They sample 100 first-time borrowers and find 
53 of these loans are smaller that the other borrowers. For the 
hypothesis test, they choose a 5% level of significance. 


Solution: 

STEP 1: Set the null and alternative hypothesis. 

Ho: p = 0.50 Hg: p # 0.50 

The words "is the same or different from" tell you this is a two- 


tailed test. The Type I and Type II errors are as follows: The Type I 
error is to conclude that the proportion of borrowers is different from 


50% when, in fact, the proportion is actually 50%. (Reject the null 
hypothesis when the null hypothesis is true). The Type II error is there 
is not enough evidence to conclude that the proportion of first time 
borrowers differs from 50% when, in fact, the proportion does differ 
from 50%. (You fail to reject the null hypothesis when the null 
hypothesis is false.) 


STEP 2: Decide the level of significance and draw the graph showing 
the critical value 


The level of significance has been set by the problem at the 95% 
level. Because this is two-tailed test one-half of the alpha value will 
be in the upper tail and one-half in the lower tail as shown on the 
graph. The critical value for the normal distribution at the 95% level 
of confidence is 1.96. This can easily be found on the student’s t-table 
at the very bottom at infinite degrees of freedom remembering that at 
infinity the t-distribution is the normal distribution. Of course the 
value can also be found on the normal table but you have go looking 
for one-half of 95 (0.475) inside the body of the table and then read 
out to the sides and top for the number of standard deviations. 


STEP 3: Calculate the sample parameters and critical value of the test 
Statistic. 


The test statistic is a normal distribution, Z, for testing proportions 
and is: 
Equation: 


For this case, the sample of 100 found 53 first-time borrowers were 
different from other borrowers. The sample proportion, p’ = 53/100= 
0.53 The test question, therefore, is : “Is 0.53 significantly different 
from .50?” Putting these values into the formula for the test statistic 
we find that 0.53 is only 0.60 standard deviations away from .50. This 
is barely off of the mean of the standard normal distribution of zero. 
There is virtually no difference from the sample proportion and the 
hypothesized proportion in terms of standard deviations. 


STEP 4: Compare the test statistic and the critical value. 


The calculated value is well within the critical values of + 1.96 
standard deviations and thus we cannot reject the null hypothesis. To 
reject the null hypothesis we need significant evident of difference 
between the hypothesized value and the sample value. In this case the 
sample value is very nearly the same as the hypothesized value 
measured in terms of standard deviations. 


STEP 5: Reach a conclusion 


The formal conclusion would be “At a 95% level of significance we 
cannot reject the null hypothesis that 50% of first-time borrowers 
have the same size loans as other borrowers”. Less formally we would 
say that “There is no evidence that one-half of first-time borrowers 
are significantly different in loan size from other borrowers”. Notice 
the length to which the conclusion goes to include all of the 
conditions that are attached to the conclusion. Statisticians for all the 
criticism they receive, are careful to be very specific even when this 
seems trivial. Statisticians cannot say more than they know and the 
data constrain the conclusion to be within the metes and bounds of the 
data. 


Note: 
Try It 
Exercise: 


Problem: 


A teacher believes that 85% of students in the class will want to go on 
a field trip to the local zoo. She performs a hypothesis test to 
determine if the percentage is the same or different from 85%. The 
teacher samples 50 students and 39 reply that they would want to go 
to the zoo. For the hypothesis test, use a 1% level of significance. 


Solution: 


Since the problem is about percentages, this is a test of single 
population proportions. 


Ho 0/2 = 0.85 
Hg: p 4 0.85 


p = 0.7554 


ee dpe: 
5 (p-value) 5 (p-value) 


Because p > a, we fail to reject the null hypothesis. There is not 
sufficient evidence to suggest that the proportion of students that want 
to go to the zoo is not 85%. 


Example: 
Exercise: 


Problem: 


Suppose a consumer group suspects that the proportion of households 
that have three or more cell phones is 30%. A cell phone company has 
reason to believe that the proportion is not 30%. Before they start a 
big advertising campaign, they conduct a hypothesis test. Their 
marketing people survey 150 households with the result that 43 of the 
households have three or more cell phones. 


Solution: 


Here is an abbreviate version of the system to solve hypothesis tests 
applied to a test on a proportions. 


Equation: 

Ho: p= 0.3 
Equation: 

Hop 403 
Equation: 

n = 150 
Equation: 
ies 2 = = = 0.287 

Equation: 


p’—po _ 0.287 —-0.3 


/ 3(.7) 
15 


= 0.347 


% = 0.05 


-1.64 aa 347 1.64 


At a significance level of 90% 
we cannot reject H,: 
the consumer group Is correct. 


Example: 
Exercise: 


Problem: 


The National Institute of Standards and Technology provides exact 
data on conductivity properties of materials. Following are 
conductivity measurements for 11 randomly selected pieces of a 
particular type of glass. 


(ie 07 tll 107 12: 1.087298: -98 1.02: (953.95 
Is there convincing evidence that the average conductivity of this type 
of glass is greater than one? Use a significance level of 0.05. 


Solution: 
Let’s follow a four-step process to answer this statistical question. 
1. State the Question: We need to determine if, at a 0.05 


significance level, the average conductivity of the selected glass 
is greater than one. Our hypotheses will be 


a. Ho: p< 
labia 


2. Plan: We are testing a sample mean without a known population 
standard deviation with less than 30 observations. Therefore, we 
need to use a Student's-t distribution. Assume the underlying 
population is normal. 

3. Do the calculations and draw the graph. 

4. State the Conclusions: We cannot accept the null hypothesis. It 
is reasonable to state that the data supports the claim that the 
average conductivity level is greater than one. 


Example: 
Exercise: 


Problem: 


In a study of 420,019 cell phone users, 172 of the subjects developed 
brain cancer. Test the claim that cell phone users developed brain 
cancer at a greater rate than that for non-cell phone users (the rate of 
brain cancer for non-cell phone users is 0.0340%). Since this is a 
critical issue, use a 0.005 significance level. Explain why the 
significance level should be so low in terms of a Type I error. 


Solution: 


1. We need to conduct a hypothesis test on the claimed cancer rate. 
Our hypotheses will be 


a. Hp: p < 0.00034 
b. H,: p > 0.00034 


If we commit a Type I error, we are essentially accepting a false 
claim. Since the claim describes cancer-causing environments, 


we want to minimize the chances of incorrectly identifying 
causes of cancer. 

2. We will be testing a sample proportion with x = 172 and n= 
420,019. The sample is sufficiently large because we have np’ = 
420,019(0.00034) = 142.8, nq' = 420,019(0.99966) = 419,876.2, 
two independent outcomes, and a fixed probability of success p' 
= 0.00034. Thus we will be able to generalize our results to the 
population. 


Chapter Review 


The hypothesis test itself has an established process. This can be 
summarized as follows: 


1. Determine Hj and H,. Remember, they are contradictory. 

2. Determine the random variable. 

3. Determine the distribution for the test. 

4. Draw a graph and calculate the test statistic. 

5. Compare the calculated test statistic with the Z critical value 
determined by the level of significance required by the test and make a 
decision (cannot reject Hg or cannot accept Ho), and write a clear 
conclusion using English sentences. 


Exercise: 


Problem: 


Assume Ho: p = 9 and H;,: p < 9. Is this a left-tailed, right-tailed, or 
two-tailed test? 


Solution: 


This is a left-tailed test. 


Exercise: 


Problem: 


Assume Ho: p < 6 and H,: p > 6. Is this a left-tailed, right-tailed, or 
two-tailed test? 


Exercise: 


Problem: 


Assume Ho: p = 0.25 and H,: p # 0.25. Is this a left-tailed, right-tailed, 
or two-tailed test? 


Solution: 


This is a two-tailed test. 


Exercise: 


Problem: Draw the general graph of a left-tailed test. 


Exercise: 


Problem: Draw the graph of a two-tailed test. 


Solution: 


1 (py. 1p 
5(P value) 5 (p-value) 


x! 


Exercise: 


Problem: 


A bottle of water is labeled as containing 16 fluid ounces of water. You 
believe it is less than that. What type of test would you use? 


Exercise: 
Problem: 


Your friend claims that his mean golf score is 63. You want to show 
that it is higher than that. What type of test would you use? 


Solution: 


a right-tailed test 
Exercise: 
Problem: 
A bathroom scale claims to be able to identify correctly any weight 


within a pound. You think that it cannot be that accurate. What type of 
test would you use? 


Exercise: 
Problem: 
You flip a coin and record whether it shows heads or tails. You know 


the probability of getting heads is 50%, but you think it is less for this 
particular coin. What type of test would you use? 


Solution: 


a left-tailed test 
Exercise: 
Problem: 
If the alternative hypothesis has a not equals ( # ) symbol, you know to 
use which type of test? 


Exercise: 


Problem: 


Assume the null hypothesis states that the mean is at least 18. Is this a 
left-tailed, right-tailed, or two-tailed test? 


Solution: 


This is a left-tailed test. 
Exercise: 


Problem: 


Assume the null hypothesis states that the mean is at most 12. Is this a 
left-tailed, right-tailed, or two-tailed test? 


Exercise: 


Problem: 

Assume the null hypothesis states that the mean is equal to 88. The 
alternative hypothesis states that the mean is not equal to 88. Is this a 
left-tailed, right-tailed, or two-tailed test? 


Solution: 


This is a two-tailed test. 


Homework 


Exercise: 


Problem: 


A particular brand of tires claims that its deluxe tire averages at least 
50,000 miles before it needs to be replaced. From past studies of this 
tire, the standard deviation is known to be 8,000. A survey of owners 
of that tire design is conducted. From the 28 tires surveyed, the mean 
lifespan was 46,500 miles with a standard deviation of 9,800 miles. 
Using alpha = 0.05, is the data highly inconsistent with the claim? 


Solution: 


a. Ho: p = 50,000 

b. Hg: up < 50,000 

c. Let _X = the average lifespan of a brand of tires. 
d. normal distribution 

e. Z=-2.315 

f. p-value = 0.0103 

g. Check student’s solution. 


h. i. alpha: 0.05 
ii. Decision: Reject the null hypothesis. 
iii. Reason for decision: The p-value is less than 0.05. 
iv. Conclusion: There is sufficient evidence to conclude that the 
mean lifespan of the tires is less than 50,000 miles. 


i. (43,537, 49,463) 


Exercise: 


Problem: 


From generation to generation, the mean age when smokers first start 
to smoke varies. However, the standard deviation of that age remains 
constant of around 2.1 years. A survey of 40 smokers of this 
generation was done to see if the mean starting age is at least 19. The 
sample mean was 18.1 with a sample standard deviation of 1.3. Do the 
data support the claim at the 5% level? 


Exercise: 


Problem: 


The cost of a daily newspaper varies from city to city. However, the 
variation among prices remains steady with a standard deviation of 
20¢. A study was done to test the claim that the mean cost of a daily 
newspaper is $1.00. Twelve costs yield a mean cost of 95¢ with a 
standard deviation of 18¢. Do the data support the claim at the 1% 
level? 


Solution: 
a. Ho: p = $1.00 
b. Ha: p 4 $1.00 


c. Let X = the average cost of a daily newspaper. 
d. normal distribution 

e. z = —0.866 

f. p-value = 0.3865 

g. Check student’s solution. 


h. i. Alpha: 0.01 
ii. Decision: Do not reject the null hypothesis. 
iii. Reason for decision: The p-value is greater than 0.01. 
iv. Conclusion: There is sufficient evidence to support the claim 


that the mean cost of daily papers is $1. The mean cost could 
be $1. 


i. ($0.84, $1.06) 


Exercise: 


Problem: 


An article in the San Jose Mercury News stated that students in the 
California state university system take 4.5 years, on average, to finish 
their undergraduate degrees. Suppose you believe that the mean time is 
longer. You conduct a survey of 49 students and obtain a sample mean 
of 5.1 with a sample standard deviation of 1.2. Do the data support 
your claim at the 1% level? 


Exercise: 


Problem: 


The mean number of sick days an employee takes per year is believed 
to be about ten. Members of a personnel department do not believe this 
figure. They randomly survey eight employees. The number of sick 
days they took for the past year are as follows: 12; 4; 15; 3; 11; 8; 6; 8. 
Let x = the number of sick days they took for the past year. Should the 
personnel team believe that the mean number is ten? 


Solution: 
a. Ho: p = 10 
b. Hg: p # 10 


c. Let X the mean number of sick days an employee takes per year. 
d. Student’s t-distribution 

e. t=-1.12 

f. p-value = 0.300 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Do not reject the null hypothesis. 
iii. Reason for decision: The p-value is greater than 0.05. 
iv. Conclusion: At the 5% significance level, there is 
insufficient evidence to conclude that the mean number of 
sick days is not ten. 


i. (4.9443, 11.806) 


Exercise: 


Problem: 


In 1955, Life Magazine reported that the 25 year-old mother of three 
worked, on average, an 80 hour week. Recently, many groups have 
been studying whether or not the women's movement has, in fact, 
resulted in an increase in the average work week for women 
(combining employment and at-home work). Suppose a study was 
done to determine if the mean work week has increased. 81 women 
were surveyed with the following results. The sample mean was 83; 
the sample standard deviation was ten. Does it appear that the mean 
work week has increased for women at the 5% level? 


Exercise: 


Problem: 


Your statistics instructor claims that 60 percent of the students who 
take her Elementary Statistics class go through life feeling more 
enriched. For some reason that she can't quite figure out, most people 
don't believe her. You decide to check this out on your own. You 
randomly survey 64 of her past Elementary Statistics students and find 
that 34 feel more enriched as a result of her class. Now, what do you 
think? 


Solution: 


a. Ho: p = 0.6 

b. Hg: p < 0.6 

c. Let P'= the proportion of students who feel more enriched as a 
result of taking Elementary Statistics. 

d. normal for a single proportion 

a ti 

f. p-value = 0.1308 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Do not reject the null hypothesis. 


iii. Reason for decision: The p-value is greater than 0.05. 
iv. Conclusion: There is insufficient evidence to conclude that 
less than 60 percent of her students feel more enriched. 


i. Confidence Interval: (0.409, 0.654) 
The “plus-4s” confidence interval is (0.411, 0.648) 


Exercise: 


Problem: 


A Nissan Motor Corporation advertisement read, “The average man’s 
1.Q. is 107. The average brown trout’s I.Q. is 4. So why can’t man 
catch brown trout?” Suppose you believe that the brown trout’s mean 
I.Q. is greater than four. You catch 12 brown trout. A fish psychologist 
determines the I.Q.s as follows: 5; 4; 7; 3; 6; 4; 5; 3; 6; 3; 8; 5. 
Conduct a hypothesis test of your belief. 


Exercise: 
Problem: 
Refer to Exercise 9.119. Conduct a hypothesis test to see if your 


decision and conclusion would change if your belief were that the 
brown trout’s mean I.Q. is not four. 


Solution: 
a. Ho: p= 4 
b. Hg: uz~4 


c. Let X the average I.Q. of a set of brown trout. 
d. two-tailed Student's t-test 

e.t= 1.95 

f. p-value = 0.076 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Reject the null hypothesis. 
iii. Reason for decision: The p-value is greater than 0.05 


iv. Conclusion: There is insufficient evidence to conclude that 
the average IQ of brown trout is not four. 


i. (3.8865,5.9468) 


Exercise: 


Problem: 


According to an article in Newsweek, the natural ratio of girls to boys 
is 100:105. In China, the birth ratio is 100: 114 (46.7% girls). Suppose 
you don’t believe the reported figures of the percent of girls born in 
China. You conduct a study. In this study, you count the number of 
girls and boys born in 150 randomly chosen recent births. There are 60 
girls and 90 boys born of the 150. Based on your study, do you believe 
that the percent of girls born in China is 46.7? 


Exercise: 


Problem: 


A poll done for Newsweek found that 13% of Americans have seen or 
sensed the presence of an angel. A contingent doubts that the percent is 
really that high. It conducts its own survey. Out of 76 Americans 
surveyed, only two had seen or sensed the presence of an angel. As a 
result of the contingent’s survey, would you agree with the Newsweek 
poll? In complete sentences, also give three reasons why the two polls 
might give different results. 


Solution: 


a. Ho: p = 0.13 

b. Hg: p < 0.13 

c. Let P'= the proportion of Americans who have seen or sensed 
angels 

d. normal for a single proportion 

e. —2.688 

f. p-value = 0.0036 

g. Check student’s solution. 


h. i. alpha: 0.05 
ii. Decision: Reject the null hypothesis. 
iii. Reason for decision: The p-value is less than 0.05. 
iv. Conclusion: There is sufficient evidence to conclude that the 
percentage of Americans who have seen or sensed an angel 
is less than 13%. 


i. (0, 0.0623). 
The“plus-4s” confidence interval is (0.0022, 0.0978) 


Exercise: 


Problem: 


The mean work week for engineers in a start-up company is believed 
to be about 60 hours. A newly hired engineer hopes that it’s shorter. 
She asks ten engineering friends in start-ups for the lengths of their 
mean work weeks. Based on the results that follow, should she count 
on the mean work week to be shorter than 60 hours? 


Data (length of mean work week): 70; 45; 55; 60; 65; 55; 55; 60; 50; 
Bien 


Exercise: 


Problem: 


Sixty-eight percent of online courses taught at community colleges 
nationwide were taught by full-time faculty. To test if 68% also 
represents California’s percent for full-time faculty teaching the online 
classes, Long Beach City College (LBCC) in California, was randomly 
selected for comparison. In the same year, 34 of the 44 online courses 
LBCC offered were taught by full-time faculty. Conduct a hypothesis 
test to determine if 68% represents California. NOTE: For more 
accurate results, use more California community colleges and this past 
year's data. 


Exercise: 


Problem: 


According to an article in Bloomberg Businessweek, New York City's 
most recent adult smoking rate is 14%. Suppose that a survey is 
conducted to determine this year’s rate. Nine out of 70 randomly 
chosen N.Y. City residents reply that they smoke. Conduct a 
hypothesis test to determine if the rate is still 14% or if it has 
decreased. 


Solution: 
a. Ho: p = 0.14 
b. H,: p < 0.14 


c. Let P'= the proportion of NYC residents that smoke. 
d. normal for a single proportion 

e. —0.2756 

f. p-value = 0.3914 

g. Check student’s solution. 


h. i. alpha: 0.05 
ii. Decision: Do not reject the null hypothesis. 
iii. Reason for decision: The p-value is greater than 0.05. 
iv. At the 5% significance level, there is insufficient evidence to 
conclude that the proportion of NYC residents who smoke is 
less than 0.14. 


i. Confidence Interval: (0.0502, 0.2070): The “plus-4s” confidence 
interval (see chapter 8) is (0.0676, 0.2297). 


Exercise: 


Problem: 


The mean age of De Anza College students in a previous term was 
26.6 years old. An instructor thinks the mean age for online students is 
older than 26.6. She randomly surveys 56 online students and finds 
that the sample mean is 29.4 with a standard deviation of 2.1. Conduct 
a hypothesis test. 


Exercise: 


Problem: 


Registered nurses earned an average annual salary of $69,110. For that 
same year, a survey was conducted of 41 California registered nurses 
to determine if the annual salary is higher than $69,110 for California 
nurses. The sample average was $71,121 with a sample standard 
deviation of $7,489. Conduct a hypothesis test. 


Solution: 


a. Ho: p = 69,110 

b. Hg: p > 69,110 

c. Let X = the mean salary in dollars for California registered 
nurses. 

d. Student's ¢t-distribution 

e. t= 1.719 

f. p-value: 0.0466 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Reject the null hypothesis. 
iii. Reason for decision: The p-value is less than 0.05. 
iv. Conclusion: At the 5% significance level, there is sufficient 
evidence to conclude that the mean salary of California 
registered nurses exceeds $69,110. 


i. ($68,757, $73,485) 


Exercise: 


Problem: 


La Leche League International reports that the mean age of weaning a 
child from breastfeeding is age four to five worldwide. In America, 
most nursing mothers wean their children much earlier. Suppose a 
random survey is conducted of 21 U.S. mothers who recently weaned 
their children. The mean weaning age was nine months (3/4 year) with 
a standard deviation of 4 months. Conduct a hypothesis test to 
determine if the mean weaning age in the U.S. is less than four years 
old. 


Exercise: 


Problem: 


Over the past few decades, public health officials have examined the 
link between weight concerns and teen girls' smoking. Researchers 
surveyed a group of 273 randomly selected teen girls living in 
Massachusetts (between 12 and 15 years old). After four years the girls 
were surveyed again. Sixty-three said they smoked to stay thin. Is there 
good evidence that more than thirty percent of the teen girls smoke to 
stay thin? 

After conducting the test, your decision and conclusion are 


a. Reject Hg: There is sufficient evidence to conclude that more than 
30% of teen girls smoke to stay thin. 

b. Do not reject Hp: There is not sufficient evidence to conclude that 
less than 30% of teen girls smoke to stay thin. 

c. Do not reject Hp: There is not sufficient evidence to conclude that 
more than 30% of teen girls smoke to stay thin. 

d. Reject Ho: There is sufficient evidence to conclude that less than 
30% of teen girls smoke to stay thin. 


Solution: 


C 


Exercise: 


Problem: 


A Statistics instructor believes that fewer than 20% of Evergreen 
Valley College (EVC) students attended the opening night midnight 
showing of the latest Harry Potter movie. She surveys 84 of her 
students and finds that 11 of them attended the midnight showing. 
At a 1% level of significance, an appropriate conclusion is: 


a. There is insufficient evidence to conclude that the percent of EVC 
students who attended the midnight showing of Harry Potter is 
less than 20%. 

b. There is sufficient evidence to conclude that the percent of EVC 
students who attended the midnight showing of Harry Potter is 
more than 20%. 

c. There is sufficient evidence to conclude that the percent of EVC 
students who attended the midnight showing of Harry Potter is 
less than 20%. 

d. There is insufficient evidence to conclude that the percent of EVC 
students who attended the midnight showing of Harry Potter is at 
least 20%. 


Exercise: 


Problem: 


Previously, an organization reported that teenagers spent 4.5 hours per 
week, on average, on the phone. The organization thinks that, 
currently, the mean is higher. Fifteen randomly chosen teenagers were 
asked how many hours per week they spend on the phone. The sample 
mean was 4.75 hours with a sample standard deviation of 2.0. Conduct 
a hypothesis test. 


At a significance level of a = 0.05, what is the correct conclusion? 


a. There is enough evidence to conclude that the mean number of 
hours is more than 4.75 


b. There is enough evidence to conclude that the mean number of 
hours is more than 4.5 

c. There is not enough evidence to conclude that the mean number 
of hours is more than 4.5 

d. There is not enough evidence to conclude that the mean number 
of hours is more than 4.75 


Solution: 


Instructions: For the following ten exercises, 
Hypothesis testing: For the following ten exercises, answer each question. 


a. State the null and alternate hypothesis. 

b. State the p-value. 

c. State alpha. 

d. What is your decision? 

e. Write a conclusion. 

f. Answer any other questions asked in the problem. 


Exercise: 


Problem: 


According to the Center for Disease Control website, in 2011 at least 
18% of high school students have smoked a cigarette. An Introduction 
to Statistics class in Davies County, KY conducted a hypothesis test at 
the local high school (a medium sized—approximately 1,200 students— 
small city demographic) to determine if the local high school’s 
percentage was lower. One hundred fifty students were chosen at 
random and surveyed. Of the 150 students surveyed, 82 have smoked. 
Use a significance level of 0.05 and using appropriate statistical 
evidence, conduct a hypothesis test and state the conclusions. 


Exercise: 


Problem: 


A recent survey in the N.Y. Times Almanac indicated that 48.8% of 
families own stock. A broker wanted to determine if this survey could 
be valid. He surveyed a random sample of 250 families and found that 
142 owned some type of stock. At the 0.05 significance level, can the 
survey be considered to be accurate? 


Solution: 


a. Ho: p = 0.488 H,: p # 0.488 

b. p-value = 0.0114 

c. alpha = 0.05 

d. Reject the null hypothesis. 

e. At the 5% level of significance, there is enough evidence to 
conclude that 48.8% of families own stocks. 

f. The survey does not appear to be accurate. 


Exercise: 


Problem: 


Driver error can be listed as the cause of approximately 54% of all 
fatal auto accidents, according to the American Automobile 
Association. Thirty randomly selected fatal accidents are examined, 
and it is determined that 14 were caused by driver error. Using a = 
0.05, is the AAA proportion accurate? 


Exercise: 
Problem: 
The US Department of Energy reported that 51.7% of homes were 
heated by natural gas. A random sample of 221 homes in Kentucky 
found that 115 were heated by natural gas. Does the evidence support 


the claim for Kentucky at the a = 0.05 level in Kentucky? Are the 
results applicable across the country? Why? 


Solution: 


a. Ho: p = 0.517 H,: p # 0.517 

b. p-value = 0.9203. 

c. alpha = 0.05. 

d. Do not reject the null hypothesis. 

e, At the 5% significance level, there is not enough evidence to 
conclude that the proportion of homes in Kentucky that are heated 
by natural gas is 0.517. 

f. However, we cannot generalize this result to the entire nation. 
First, the sample’s population is only the state of Kentucky. 
Second, it is reasonable to assume that homes in the extreme 
north and south will have extreme high usage and low usage, 
respectively. We would need to expand our sample base to 
include these possibilities if we wanted to generalize this claim to 
the entire nation. 


Exercise: 


Problem: 


For Americans using library services, the American Library 
Association claims that at most 67% of patrons borrow books. The 
library director in Owensboro, Kentucky feels this is not true, so she 
asked a local college statistic class to conduct a survey. The class 
randomly selected 100 patrons and found that 82 borrowed books. Did 
the class demonstrate that the percentage was higher in Owensboro, 
KY? Use a = 0.01 level of significance. What is the possible 
proportion of patrons that do borrow books from the Owensboro 
Library? 


Exercise: 


Problem: 


The Weather Underground reported that the mean amount of summer 
rainfall for the northeastern US is at least 11.52 inches. Ten cities in 
the northeast are randomly selected and the mean rainfall amount is 
calculated to be 7.42 inches with a standard deviation of 1.3 inches. At 
the a = 0.05 level, can it be concluded that the mean rainfall was below 
the reported average? What if « = 0.012? Assume the amount of 
summer rainfall follows a normal distribution. 


Solution: 


a. Agi 11.52 A 152 

b. p-value = 0.000002 which is almost 0. 

c. alpha = 0.05. 

d. Reject the null hypothesis. 

e. At the 5% significance level, there is enough evidence to 
conclude that the mean amount of summer rain in the northeaster 
US is less than 11.52 inches, on average. 

f. We would make the same conclusion if alpha was 1% because the 
p-value is almost 0. 


Exercise: 


Problem: 


A survey in the N.Y. Times Almanac finds the mean commute time 
(one way) is 25.4 minutes for the 15 largest US cities. The Austin, TX 
chamber of commerce feels that Austin’s commute time is less and 
wants to publicize this fact. The mean for 25 randomly selected 
commuters is 22.1 minutes with a standard deviation of 5.3 minutes. 
At the a = 0.10 level, is the Austin, TX commute significantly less 
than the mean commute time for the 15 largest US cities? 


Exercise: 


Problem: 


A report by the Gallup Poll found that a woman visits her doctor, on 
average, at most 5.8 times each year. A random sample of 20 women 
results in these yearly visit totals 


32137294668056421341 
At the a = 0.05 level can it be concluded that the sample mean is 
higher than 5.8 visits per year? 


Solution: 


dp: ts 0.8.2 S 5.6 

b. p-value = 0.9987 

c. alpha = 0.05 

d. Do not reject the null hypothesis. 

e. At the 5% level of significance, there is not enough evidence to 
conclude that a woman visits her doctor, on average, more than 
5.8 times a year. 


Exercise: 


Problem: 


According to the N.Y. Times Almanac the mean family size in the U.S. 
is 3.18. A sample of a college math class resulted in the following 
family sizes: 

545443643355633274522232 

At a= 0.05 level, is the class’ mean family size greater than the 
national average? Does the Almanac result remain valid? Why? 


Exercise: 


Problem: 


The student academic group on a college campus claims that freshman 
students study at least 2.5 hours per day, on average. One Introduction 
to Statistics class was skeptical. The class took a random sample of 30 
freshman students and found a mean study time of 137 minutes with a 
standard deviation of 45 minutes. At a = 0.01 level, is the student 
academic group’s claim correct? 


Solution: 


a. Ho: wp = 150 Hy: p < 150 

b. p-value = 0.0622 

c. alpha = 0.01 

d. Do not reject the null hypothesis. 

e. At the 1% significance level, there is not enough evidence to 
conclude that freshmen students study less than 2.5 hours per day, 
on average. 

f. The student academic group’s claim appears to be correct. 
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Glossary 


Central Limit Theorem 
Given a random variable (RV) with known mean ps and known 
standard deviation o. We are sampling with size n and we are 
interested in two new RVs - the sample mean, X. If the size n of the 


sample is sufficiently large, then X ~N (un, . If the size n of the 


sample is sufficiently large, then the distribution of the sample means 
will approximate a normal distribution regardless of the shape of the 

population. The expected value of the mean of the sample means will 
equal the population mean. The standard deviation of the distribution 


of the sample means, Wee is called the standard error of the mean. 


Rare Events, the Sample, Decision and Conclusion 


Establishing the type of distribution, sample size, and known or unknown 
standard deviation can help you figure out how to go about a hypothesis 
test. However, there are several other factors you should consider when 
working out a hypothesis test. 


Rare Events 


Suppose you make an assumption about a property of the population (this 
assumption is the null hypothesis). Then you gather sample data randomly. 
If the sample has properties that would be very unlikely to occur if the 
assumption is true, then you would conclude that your assumption about the 
population is probably incorrect. (Remember that your assumption is just an 
assumption— it is not a fact and it may or may not be true. But your sample 
data are real and the data are showing you a fact that seems to contradict 
your assumption. ) 


For example, Didi and Ali are at a birthday party of a very wealthy friend. 
They hurry to be first in line to grab a prize from a tall basket that they 
cannot see inside because they will be blindfolded. There are 200 plastic 
bubbles in the basket and Didi and Ali have been told that there is only one 
with a $100 bill. Didi is the first person to reach into the basket and pull out 
a bubble. Her bubble contains a $100 bill. The probability of this happening 
is in = 0.005. Because this is so unlikely, Ali is hoping that what the two 


of them were told is wrong and there are more $100 bills in the basket. A 
"rare event" has occurred (Didi getting the $100 bill) so Ali doubts the 
assumption about only one $100 bill being in the basket. 


Chapter Review 


When the probability of an event occurring is low, and it happens, it is 
called a rare event. Rare events are important to consider in hypothesis 
testing because they can inform your willingness not to reject or to reject a 
null hypothesis. To test a null hypothesis, find the p-value for the sample 
data and graph the results. 


Exercise: 


Problem: When do you reject the null hypothesis? 
Exercise: 
Problem: 


The probability of winning the grand prize at a particular carnival 
game is 0.005. Is the outcome of winning very likely or very unlikely? 


Solution: 


The outcome of winning is very unlikely. 
Exercise: 


Problem: 


The probability of winning the grand prize at a particular carnival 
game is 0.005. Michele wins the grand prize. Is this considered a rare 
or common event? Why? 


Exercise: 


Problem: 


It is believed that the mean height of high school students who play 
basketball on the school team is 73 inches with a standard deviation of 
1.8 inches. A random sample of 40 players is chosen. The sample 
mean was 71 inches, and the sample standard deviation was 1.5 years. 
Do the data support the claim that the mean height is less than 73 
inches? The p-value is almost zero. State the null and alternative 
hypotheses and interpret the p-value. 


Solution: 


Ao: p> = 73 

Hews 73 

The p-value is almost zero, which means there is sufficient data to 
conclude that the mean height of high school students who play 


basketball on the school team is less than 73 inches at the 5% level. 
The data do support the claim. 


Exercise: 


Problem: 


The mean age of graduate students at a University is at most 31 y ears 
with a standard deviation of two years. A random sample of 15 
graduate students is taken. The sample mean is 32 years and the 
sample standard deviation is three years. Are the data significant at the 
1% level? The p-value is 0.0264. State the null and alternative 
hypotheses and interpret the p-value. 


Exercise: 
Problem: 


Does the shaded region represent a low or a high p-value compared to 
a level of significance of 1%? 


p-value is 
approximately 0 


15 a7 


Solution: 


The shaded region shows a low p-value. 


Exercise: 


Problem: What should you do when a > p-value? 


Exercise: 


Problem: What should you do if a = p-value? 


Solution: 


Do not reject Hp. 
Exercise: 
Problem: 


If you do not reject the null hypothesis, then it must be true. Is this 
statement correct? State why or why not in complete sentences. 


Use the following information to answer the next seven exercises: Suppose 
that a recent article stated that the mean time spent in jail by a first-time 
convicted burglar is 2.5 years. A study was then done to see if the mean 
time has increased in the new century. A random sample of 26 first-time 
convicted burglars in a recent year was picked. The mean length of time in 
jail from the survey was three years with a standard deviation of 1.8 years. 
Suppose that it is somehow known that the population standard deviation is 
1.5. Conduct a hypothesis test to determine if the mean length of jail time 
has increased. Assume the distribution of the jail times is approximately 
normal. 

Exercise: 


Problem: Is this a test of means or proportions? 


Solution: 


means 


Exercise: 


Problem: What symbol represents the random variable for this test? 


Exercise: 


Problem: In words, define the random variable for this test. 


Solution: 


the mean time spent in jail for 26 first time convicted burglars 
Exercise: 


Problem: 


Is the population standard deviation known and, if so, what is it? 


Exercise: 


Problem: Calculate the following: 


N 
bad 


An op 
=) Qa 8 


Solution: 


a em ogn 
NOR rR WwW 


a) 
8 
6 


Exercise: 


Problem: 


Since both o and s, are given, which should be used? In one to two 
complete sentences, explain why. 


Exercise: 
Problem: State the distribution to use for the hypothesis test. 
Solution: 


_ AS 
x N (25, 15. 


Exercise: 


Problem: 


A random survey of 75 death row inmates revealed that the mean 
length of time on death row is 17.4 years with a standard deviation of 
6.3 years. Conduct a hypothesis test to determine if the population 
mean time on death row could likely be 15 years. 


a. Is this a test of one mean or proportion? 
b. State the null and alternative hypotheses. 
Ho: Hg : 
c. Is this a right-tailed, left-tailed, or two-tailed test? 
d. What symbol represents the random variable for this test? 
e. In words, define the random variable for this test. 
f. Is the population standard deviation known and, if so, what is it? 
g. Calculate the following: 


L2= 
i.s= 
iii. n= 
h. Which test should be used? 
i. State the distribution to use for the hypothesis test. 
j. Find the p-value. 
k. At a pre-conceived a = 0.05, what is your: 
i. Decision: 
ii. Reason for the decision: 
iii. Conclusion (write out in a complete sentence): 
Homework 


Exercise: 


Problem: 


The National Institute of Mental Health published an article stating 
that in any one-year period, approximately 9.5 percent of American 
adults suffer from depression or a depressive illness. Suppose that in a 
survey of 100 people in a certain town, seven of them suffered from 
depression or a depressive illness. Conduct a hypothesis test to 
determine if the true proportion of people in that town suffering from 
depression or a depressive illness is lower than the percent in the 
general adult American population. 


a. Is this a test of one mean or proportion? 
b. State the null and alternative hypotheses. 
Ho: Hg: 
c. Is this a right-tailed, left-tailed, or two-tailed test? 
d. What symbol represents the random variable for this test? 
e. In words, define the random variable for this test. 
f. Calculate the following: 


x= 
i.n= 
iii. p’ = 
g. Calculate o, = . Show the formula set-up. 
h. State the distribution to use for the hypothesis test. 


i. Find the p-value. 
j. At a pre-conceived a = 0.05, what is your: 


i. Decision: 


ii. Reason for the decision: 
iii. Conclusion (write out in a complete sentence): 


Glossary 


Level of Significance of the Test 


probability of a Type I error (reject the null hypothesis when it is true). 
Notation: a. In hypothesis testing, the Level of Significance is called 
the preconceived a or the preset a. The Confidence level is (1-a). 


Introduction 
class="introduction" 


If you 
want to 
test a 
claim that 
involves 
two groups 
(the types 
of 
breakfasts 
eaten east 
and west 
of the 
Mississipp 
i River) 
you can 
use a 
slightly 
different 
technique 
when 
conducting 
a 
hypothesis 
test. 
(credit: 
Chloe 
Lim) 


Studies often compare two groups. For example, researchers are interested 
in the effect aspirin has in preventing heart attacks. Over the last few years, 
newspapers and magazines have reported various aspirin studies involving 
two groups. Typically, one group is given aspirin and the other group is 
given a placebo. Then, the heart attack rate is studied over several years. 


There are other situations that deal with the comparison of two groups. For 
example, studies compare various diet and exercise programs. Politicians 
compare the proportion of individuals from different income brackets who 
might vote for them. Students are interested in whether SAT or GRE 
preparatory courses really help raise their scores. Many business 
applications require comparing two groups. It may be the investment 
returns of two different investment strategies, or the differences in 
production efficiency of different management styles. 


To compare two means or two proportions, you work with two groups. The 
groups are classified either as independent or matched pairs. 
Independent groups consist of two samples that are independent, that is, 
sample values selected from one population are not related in any way to 


sample values selected from the other population. Matched pairs consist of 
two samples that are dependent. The parameter tested using matched pairs 
is the population mean. The parameters tested using independent groups are 
either population means or population proportions of each group. 


Glossary 


Independent Groups 
two samples that are selected from two populations, and the values 
from one population are not related in any way to the values from the 
other population. 


Matched Pairs 
two samples that are dependent. Differences between a before and 
after scenario are tested by testing one population mean of differences. 


Comparing Two Independent Population Means 


The comparison of two independent population means is very common and 
provides a way to test the hypothesis that the two groups differ from each 
other. Is the night shift less productive than the day shift, are the rates of 
return from fixed asset investments different from those from common 
stock investments, and so on? An observed difference between two sample 
means depends on both the means and the sample standard deviations. Very 
different means can occur by chance if there is great variation among the 
individual samples. The test statistic will have to account for this fact. The 
test comparing two independent population means with unknown and 
possibly unequal population standard deviations is called the Aspin-Welch 
t-test. The degrees of freedom formula we will see later was developed by 
Aspin-Welch. 


When we developed the hypothesis test for the mean and proportions we 
began with the Central Limit Theorem. We recognized that a sample mean 
came from a distribution of sample means, and sample proportions came 
from the sampling distribution of sample proportions. This made our 
sample parameters, the sample means and sample proportions, into random 
variables. It was important for us to know the distribution that these random 
variables came from. The Central Limit Theorem gave us the answer: the 
normal distribution. Our Z and t statistics came from this theorem. This 
provided us with the solution to our question of how to measure the 
probability that a sample mean came from a distribution with a particular 
hypothesized value of the mean or proportion. In both cases that was the 
question: what is the probability that the mean (or proportion) from our 
sample data came from a population distribution with the hypothesized 
value we are interested in? 


Now we are interested in whether or not two samples have the same mean. 
Our question has not changed: Do these two samples come from the same 
population distribution? To approach this problem we create a new random 
variable. We recognize that we have two sample means, one from each set 
of data, and thus we have two random variables coming from two unknown 
distributions. To solve the problem we create a new random variable, the 
difference between the sample means. This new random variable also has a 


distribution and, again, the Central Limit Theorem tells us that this new 
distribution is normally distributed, regardless of the underlying 
distributions of the original data. A graph may help to understand this 


concept. 
POPULATION 1 POPULATION 2 


SAMPLING DISTRIBUTION 


fe) 


Ho: H, - WH, = 3, 
H,: HW, - H, #8, 


Pictured are two distributions of data, X; and X5, with unknown means and 
standard deviations. The second panel shows the sampling distribution of 


the newly created random variable (X ; — X2). This distribution is the 
theoretical distribution of many many sample means from population 1 
minus sample means from population 2. The Central Limit Theorem tells us 
that this theoretical sampling distribution of differences in sample means is 
normally distributed, regardless of the distribution of the actual population 
data shown in the top panel. Because the sampling distribution is normally 
distributed, we can develop a standardizing formula and calculate 
probabilities from the standard normal distribution in the bottom panel, the 
Z distribution. We have seen this same analysis before in Chapter 7 Figure 
72. 


The Central Limit Theorem, as before, provides us with the standard 
deviation of the sampling distribution, and further, that the expected value 
of the mean of the distribution of differences in sample means is equal to 
the differences in the population means. Mathematically this can be stated: 
Equation: 


Because we do not know the population standard deviations, we estimate 
them using the two sample standard deviations from our independent 
samples. For the hypothesis test, we calculate the estimated standard 
deviation, or standard error, of the difference in sample means, X ; — 
X». 
Equation: 

The standard error is: 


We remember that substituting the sample variance for the population 
variance when we did not have the population variance was the technique 
we used when building the confidence interval and the test statistic for the 
test of hypothesis for a single mean back in Confidence Intervals and 
calculated as follows: 

Equation: 


where: 


¢ s, and Sp, the sample standard deviations, are estimates of 0; and 05, 
respectively and 

¢ 0; and oj are the unknown population standard deviations. 

e 2, and Z» are the sample means. p/; and py are the unknown population 
means. 


The number of degrees of freedom (df) requires a somewhat complicated 
calculation. The df are not always a whole number. The test statistic above 
is approximated by the Student's t-distribution with df as follows: 
Equation: 

Degrees of freedom 


When both sample sizes n, and n> are 30 or larger, the Student's t 
approximation is very good. If each sample has more than 30 observations 
then the degrees of freedom can be calculated as nl + n2 - 2. 


The format of the sampling distribution, differences in sample means, 
specifies that the format of the null and alternative hypothesis is: 
Equation: 


Ao : Hi — H2 = 40 
Equation: 

A: Hi — pa # 60 
where do is the hypothesized difference between the two means. If the 
question is simply “is there any difference between the means?” then do = 0 


and the null and alternative hypotheses becomes: 
Equation: 


Ao: Wi = He 


Equation: 


A: Wi # pe 


An example of when 69 might not be zero is when the comparison of the 
two groups requires a specific difference for the decision to be meaningful. 
Imagine that you are making a capital investment. You are considering 
changing from your current model machine to another. You measure the 
productivity of your machines by the speed they produce the product. It 
may be that a contender to replace the old model is faster in terms of 
product throughput, but is also more expensive. The second machine may 
also have more maintenance costs, setup costs, etc. The null hypothesis 
would be set up so that the new machine would have to be better than the 
old one by enough to cover these extra costs in terms of speed and cost of 
production. This form of the null and alternative hypothesis shows how 
valuable this particular hypothesis test can be. For most of our work we will 
be testing simple hypotheses asking if there is any difference between the 
two distribution means. 


Example: 

Independent groups 

The Kona Iki Corporation produces coconut milk. They take coconuts and 
extract the milk inside by drilling a hole and pouring the milk into a vat for 
processing. They have both a day shift (called the B shift) and a night shift 
(called the G shift) to do this part of the process. They would like to know 
if the day shift and the night shift are equally efficient in processing the 
coconuts. A study is done sampling 9 shifts of the G shift and 16 shifts of 
the B shift. The results of the number of hours required to process 100 
pounds of coconuts is presented in [link]. A study is done and data are 
collected, resulting in the data in [link]. 


Average Number of Hours Sample 


Sample to Process 100 Pounds of Standard 
Size Coconuts Deviation 
G 
Shift D 0.866 
B 
Shift 16 Die. 1.00 
Exercise: 
Problem: 


Is there a difference in the mean amount of time for each shift to 
process 100 pounds of coconuts? Test at the 5% level of significance. 


Solution: 


The population standard deviations are not known and cannot be 
assumed to equal each other. Let g be the subscript for the G Shift 
and b be the subscript for the B Shift. Then, 1, is the population mean 
for G Shift and pp is the population mean for B Shift. This is a test of 
two independent groups, two population means. 


Random variable: X , — X, = difference in the sample mean 
amount of time between the G Shift and the B Shift takes to process 
the coconuts. 

Ao: Wg =p =o Hg — Hy = 9 

Ar: Ug * Lb Arg: Mg — Mp # 0 

The words "the same" tell you Hg has an "=". Since there are no 
other words to indicate H,, is either faster or slower. This is a two 
tailed test. 


Distribution for the test: Use tg¢ where df is calculated using the df 
formula for independent groups, two population means above. Using 
a calculator, df is approximately 18.8462. 


H,- 4,4, =0 H,: Hy = H, 
or 
H.: Hy ~ Hl, # 0 H.: Hy =, 
Equation: 
(x: — Xs) — 60 
ie = = —3.01 

St, S83 
m 1 me 


We next find the critical value on the t-table using the degrees of 
freedom from above. The critical value, 2.093, is found in the .025 
column, this is o/2, at 19 degrees of freedom. (The convention is to 
round up the degrees of freedom to make the conclusion more 
conservative.) Next we calculate the test statistic and mark this on the 
t-distribution graph. 


Make a decision: Since the calculated t-value is in the tail we cannot 
accept the null hypothesis that there is no difference between the two 


groups. The means are different. 


The graph has included the sampling distribution of the differences in 
the sample means to show how the t-distribution aligns with the 
sampling distribution data. We see in the top panel that the calculated 
difference in the two means is -1.2 and the bottom panel shows that 
this is 3.01 standard deviations from the mean. Typically we do not 
need to show the sampling distribution graph and can rely on the 
graph of the test statistic, the t-distribution in this case, to reach our 
conclusion. 


Conclusion: At the 5% level of significance, the sample data show 
there is sufficient evidence to conclude that the mean number of hours 
that the G Shift takes to process 100 pounds of coconuts is different 
from the B Shift (mean number of hours for the B Shift is greater than 
the mean number of hours for the G Shift). 


Note: 

NOTE 

When the sum of the sample sizes is larger than 30 (n, + nj > 30) you can 
use the normal distribution to approximate the Student's t. 


Example: 

A study is done to determine if Company A retains its workers longer than 
Company B. It is believed that Company A has a higher retention than 
Company B. The study finds that in a sample of 11 workers at Company A 
their average time with the company is four years with a standard deviation 
of 1.5 years. A sample of 9 workers at Company B finds that the average 
time with the company was 3.5 years with a standard deviation of 1 year. 
Test this proposition at the 1% level of significance. 

Exercise: 


Problem: a. Is this a test of two means or two proportions? 


Solution: 


a. two means because time is a continuous random variable. 


Exercise: 


Problem: 


b. Are the populations standard deviations known or unknown? 
Solution: 


b. unknown 


Exercise: 


Problem: c. Which distribution do you use to perform the test? 


Solution: 


c. Student's t 


Exercise: 


Problem: d. What is the random variable? 


Solution: 
dX Gp 
Exercise: 


Problem: e. What are the null and alternate hypotheses? 


Solution: 


Exercise: 


Problem: f. Is this test right-, left-, or two-tailed? 


Solution: 


f. right one-tailed test 


a=0.01 
t 
Q 0-89 2.764 
Ho: Ha S He 
H,? Ha > He 
Equation: 
Exercise: 


Problem:g. What is the value of the test statistic? 


Solution: 


Exercise: 


Problem:h. Can you accept/reject the null hypothesis? 
Solution: 


h. Cannot reject the null hypothesis that there is no difference between 
the two groups. Test statistic is not in the tail. The critical value of the 
t distribution is 2.764 with 10 degrees of freedom. This example 
shows how difficult it is to reject a null hypothesis with a very small 
sample. The critical values require very large test statistics to reach 
the tail. 


Exercise: 


Problem:i. Conclusion: 
Solution: 
i. At the 1% level of significance, from the sample data, there is not 


sufficient evidence to conclude that the retention of workers at 
Company A is longer than Company B, on average. 


Example: 
Exercise: 


Problem: 


An interesting research question is the effect, if any, that different 
types of teaching formats have on the grade outcomes of students. To 
investigate this issue one sample of students’ grades was taken from a 
hybrid class and another sample taken from a standard lecture format 
class. Both classes were for the same subject. The mean course grade 
in percent for the 35 hybrid students is 74 with a standard deviation of 
16. The mean grades of the 40 students form the standard lecture class 
was 76 percent with a standard deviation of 9. Test at 5% to see if 
there is any significant difference in the population mean grades 
between standard lecture course and hybrid class. 


Solution: 


We begin by noting that we have two groups, students from a hybrid 
class and students from a standard lecture format class. We also note 
that the random variable, what we are interested in, is students’ grades, 
a continuous random variable. We could have asked the research 
question in a different way and had a binary random variable. For 
example, we could have studied the percentage of students with a 
failing grade, or with an A grade. Both of these would be binary and 
thus a test of proportions and not a test of means as is the case here. 
Finally, there is no presumption as to which format might lead to 
higher grades so the hypothesis is stated as a two-tailed test. 


Ho: Hi = Ho 
Ha? H1 # M2 


As would virtually always be the case, we do not know the population 


variances of the two distributions and thus our test statistic is: 
Equation: 


1 — 22) —0 74 — 76) —O 
sop (a ow) pa ee ar 


cn aes 16? g2 
fi+2 35 + 40 


To determine the critical value of the Student's t we need the degrees 
of freedom. For this case we use: df = nl + n2 - 2 = 35+ 40 -2 = 73. 
This is large enough to consider it the normal distribution thus ta/2 = 
1.96. Again as always we determine if the calculated value is in the 
tail determined by the critical value. In this case we do not even need 
to look up the critical value: the calculated value of the difference in 
these two average grades is not even one standard deviation apart. 
Certainly not in the tail. 


Conclusion: Cannot reject the null at a=5%. Therefore, evidence 
does not exist to prove that the grades in hybrid and standard 
classes differ. 


References 


Data from Graduating Engineer + Computer Careers. Available online at 
http://www. graduatingengineer.com 


Data from Microsoft Bookshelf. 


Data from the United States Senate website, available online at 
www.senate.gov (accessed June 17, 2013). 


“List of current United States Senators by Age.” Wikipedia. Available 
online at 
http://en.wikipedia.org/wiki/List_of_current_United_States_Senators_by_a 
ge (accessed June 17, 2013). 


“Sectoring by Industry Groups.” Nasdaq. Available online at 
http://www.nasdaq.com/markets/barchart-sectors.aspx? 
page=sectors&base=industry (accessed June 17, 2013). 


“Strip Clubs: Where Prostitution and Trafficking Happen.” Prostitution 
Research and Education, 2013. Available online at 
www.prostitutionresearch.com/ProsViolPosttrauStress.html (accessed June 
17, 2013). 


“World Series History.” Baseball-Almanac, 2013. Available online at 
http://www. baseball-almanac.com/ws/wsmenu.shtml (accessed June 17, 
2013). 


Chapter Review 


Two population means from independent samples where the population 
standard deviations are not known 


¢ Random Variable: X, — X» = the difference of the sampling means 
e Distribution: Student's t-distribution with degrees of freedom 
(variances not pooled) 


Formula Review 


Standard error: SE = / fay" ae (s2)” 


ne 


ae ah tas 
Test statistic (t-score): t, = _(F1—B2)—S0_ 


(81)? a (89)? 
ny ng 


Degrees of freedom: 


2 
( (1)? 5 (22)? ) 
ny ng 


where: 


$1 and Sg are the sample standard deviations, and n; and m2 are the sample 
SIZeS. 


£1 and Zo are the sample means. 


Use the following information to answer the next 15 exercises: Indicate if 
the hypothesis test is for 


a. independent group means, population standard deviations, and/or 
variances known 

. independent group means, population standard deviations, and/or 
variances unknown 

c. matched or paired samples 

d. single mean 

e, 

f. 


oO 


two proportions 
single proportion 
Exercise: 
Problem: 
It is believed that 70% of males pass their drivers test in the first 


attempt, while 65% of females pass the test in the first attempt. Of 
interest is whether the proportions are in fact equal. 


Solution: 


two proportions 
Exercise: 
Problem: 
A new laundry detergent is tested on consumers. Of interest is the 


proportion of consumers who prefer the new brand over the leading 
competitor. A study is done to test this. 


Exercise: 
Problem: 
A new windshield treatment claims to repel water more effectively. 
Ten windshields are tested by simulating rain without the new 


treatment. The same windshields are then treated, and the experiment 
is run again. A hypothesis test is conducted. 


Solution: 


matched or paired samples 


Exercise: 
Problem: 
The known standard deviation in salary for all mid-level professionals 
in the financial industry is $11,000. Company A and Company B are in 
the financial industry. Suppose samples are taken of mid-level 
professionals from Company A and from Company B. The sample 
mean salary for mid-level professionals in Company A is $80,000. The 
sample mean salary for mid-level professionals in Company B is 


$96,000. Company A and Company B management want to know if 
their mid-level professionals are paid differently, on average. 


Exercise: 


Problem: 
The average worker in Germany gets eight weeks of paid vacation. 
Solution: 


single mean 

Exercise: 
Problem: 
According to a television commercial, 80% of dentists agree that 
Ultrafresh toothpaste is the best on the market. 

Exercise: 
Problem: 
It is believed that the average grade on an English essay in a particular 
school system for females is higher than for males. A random sample 
of 31 females had a mean score of 82 with a standard deviation of 


three, and a random sample of 25 males had a mean score of 76 with a 
standard deviation of four. 


Solution: 


independent group means, population standard deviations and/or 
variances unknown 


Exercise: 
Problem: 
The league mean batting average is 0.280 with a known standard 
deviation of 0.06. The Rattlers and the Vikings belong to the league. 
The mean batting average for a sample of eight Rattlers is 0.210, and 
the mean batting average for a sample of eight Vikings is 0.260. There 


are 24 players on the Rattlers and 19 players on the Vikings. Are the 
batting averages of the Rattlers and Vikings statistically different? 


Exercise: 
Problem: 
In a random sample of 100 forests in the United States, 56 were 
coniferous or contained conifers. In a random sample of 80 forests in 
Mexico, 40 were coniferous or contained conifers. Is the proportion of 


conifers in the United States statistically more than the proportion of 
conifers in Mexico? 


Solution: 


two proportions 
Exercise: 
Problem: 
A new medicine is said to help improve sleep. Eight subjects are 


picked at random and given the medicine. The means hours slept for 
each person were recorded before starting the medication and after. 


Exercise: 


Problem: 


It is thought that teenagers sleep more than adults on average. A study 
is done to verify this. A sample of 16 teenagers has a mean of 8.9 
hours slept and a standard deviation of 1.2. A sample of 12 adults has a 
mean of 6.9 hours slept and a standard deviation of 0.6. 


Solution: 
independent group means, population standard deviations and/or 


variances unknown 


Exercise: 


Problem: Varsity athletes practice five times a week, on average. 
Exercise: 


Problem: 


A sample of 12 in-state graduate school programs at school A has a 
mean tuition of $64,000 with a standard deviation of $8,000. At school 
B, a sample of 16 in-state graduate programs has a mean of $80,000 
with a standard deviation of $6,000. On average, are the mean tuitions 
different? 


Solution: 
independent group means, population standard deviations and/or 
variances unknown 

Exercise: 
Problem: 
A new WiFi range booster is being offered to consumers. A researcher 
tests the native range of 12 different routers under the same conditions. 
The ranges are recorded. Then the researcher uses the new WiFi range 


booster and records the new ranges. Does the new WiFi range booster 
do a better job? 


Exercise: 


Problem: 


A high school principal claims that 30% of student athletes drive 
themselves to school, while 4% of non-athletes drive themselves to 
school. In a sample of 20 student athletes, 45% drive themselves to 
school. In a sample of 35 non-athlete students, 6% drive themselves to 
school. Is the percent of student athletes who drive themselves to 
school more than the percent of nonathletes? 


Solution: 


two proportions 


Use the following information to answer the next three exercises: A study is 
done to determine which of two soft drinks has more sugar. There are 13 
cans of Beverage A in a sample and six cans of Beverage B. The mean 
amount of sugar in Beverage A is 36 grams with a standard deviation of 0.6 
grams. The mean amount of sugar in Beverage B is 38 grams with a 
standard deviation of 0.8 grams. The researchers believe that Beverage B 
has more sugar than Beverage A, on average. Both populations have normal 
distributions. 

Exercise: 


Problem: Are standard deviations known or unknown? 
Exercise: 

Problem: What is the random variable? 

Solution: 


The random variable is the difference between the mean amounts of 
sugar in the two soft drinks. 


Exercise: 


Problem: Is this a one-tailed or two-tailed test? 


Use the following information to answer the next 12 exercises: The U.S. 
Center for Disease Control reports that the mean life expectancy was 47.6 
years for whites born in 1900 and 33.0 years for nonwhites. Suppose that 
you randomly survey death records for people born in 1900 in a certain 
county. Of the 124 whites, the mean life span was 45.3 years with a 
standard deviation of 12.7 years. Of the 82 nonwhites, the mean life span 
was 34.1 years with a standard deviation of 15.6 years. Conduct a 
hypothesis test to see if the mean life spans in the county were the same for 
whites and nonwhites. 

Exercise: 


Problem: Is this a test of means or proportions? 
Solution: 


means 


Exercise: 


Problem: State the null and alternative hypotheses. 


a. Ho: 
bye a Pe 


Exercise: 


Problem: Is this a right-tailed, left-tailed, or two-tailed test? 


Solution: 


two-tailed 


Exercise: 


Problem: 


In symbols, what is the random variable of interest for this test? 


Exercise: 


Problem: In words, define the random variable of interest for this test. 
Solution: 


the difference between the mean life spans of whites and nonwhites 
Exercise: 

Problem: 

Which distribution (normal or Student's t) would you use for this 

hypothesis test? 


Exercise: 


Problem: Explain why you chose the distribution you did for [link]. 
Solution: 
This is a comparison of two population means with unknown 


population standard deviations. 


Exercise: 


Problem: Calculate the test statistic. 
Exercise: 
Problem: 
Sketch a graph of the situation. Label the horizontal axis. Mark the 


hypothesized difference and the sample difference. Shade the area 
corresponding to the p-value. 


Solution: 
Check student’s solution. 
Exercise: 
Problem: At a pre-conceived a = 0.05, what is your: 


a. Decision: 
b. Reason for the decision: 
c. Conclusion (write out in a complete sentence): 


Solution: 


a. Cannot accept the null hypothesis 

b. p-value < 0.05 

c. There is not enough evidence at the 5% level of significance to 
support the claim that life expectancy in the 1900s is different 
between whites and nonwhites. 


Exercise: 


Problem: 


Does it appear that the means are the same? Why or why not? 


Homework 


Exercise: 


Problem: 


The mean number of English courses taken in a two—year time period 
by male and female college students is believed to be about the same. 
An experiment is conducted and data are collected from 29 males and 
16 females. The males took an average of three English courses with a 
standard deviation of 0.8. The females took an average of four English 
courses with a standard deviation of 1.0. Are the means statistically the 
same? 


Exercise: 


Problem: 


A student at a four-year college claims that mean enrollment at four— 
year colleges is higher than at two—year colleges in the United States. 
Two surveys are conducted. Of the 35 two—year colleges surveyed, the 
mean enrollment was 5,068 with a standard deviation of 4,777. Of the 
35 four-year colleges surveyed, the mean enrollment was 5,466 with a 
standard deviation of 8,191. 


Solution: 
Subscripts: 1: two-year colleges; 2: four-year colleges 


a. Hy : bi = be 

b. Ha : Mi < Me 

c. X1— X¢ is the difference between the mean enrollments of the 
two-year colleges and the four-year colleges. 

d. Student’s-t 

e. test statistic: -0.2480 

f. p-value: 0.4019 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot reject 
iii. Reason for Decision: p-value > alpha 


iv. Conclusion: At the 5% significance level, there is sufficient 
evidence to conclude that the mean enrollment at four-year 
colleges is higher than at two-year colleges. 


Exercise: 


Problem: 


At Rachel’s 11" birthday party, eight girls were timed to see how long 
(in seconds) they could hold their breath in a relaxed position. After a 
two-minute rest, they timed themselves while jumping. The girls 
thought that the mean difference between their jumping and relaxed 
times would be zero. Test their hypothesis. 


Relaxed time (seconds) 
26 
47 
30 
22 
23 
45 
37 


20) 


Jumping time (seconds) 
21 
40 
28 
21 
25 
43 
35 


a2 


Exercise: 


Problem: 


Mean entry-level salaries for college graduates with mechanical 
engineering degrees and electrical engineering degrees are believed to 
be approximately the same. A recruiting office thinks that the mean 
mechanical engineering salary is actually lower than the mean 
electrical engineering salary. The recruiting office randomly surveys 
50 entry level mechanical engineers and 60 entry level electrical 
engineers. Their mean salaries were $46,100 and $46,700, 
respectively. Their standard deviations were $3,450 and $4,210, 
respectively. Conduct a hypothesis test to determine if you agree that 
the mean entry-level mechanical engineering salary is lower than the 
mean entry-level electrical engineering salary. 


Solution: 


Subscripts: 1: mechanical engineering; 2: electrical engineering 


a. Ho : 1 = pe 

b. Ha: Mi < be 

c. X, — X9 is the difference between the mean entry level salaries 
of mechanical engineers and electrical engineers. 

d. tiog 

e. test statistic: t = —0.82 

f. p-value: 0.2061 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot reject the null hypothesis. 
iii. Reason for Decision: p-value > alpha 
iv. Conclusion: At the 5% significance level, there is 
insufficient evidence to conclude that the mean entry-level 
salaries of mechanical engineers is lower than that of 
electrical engineers. 


Exercise: 


Problem: 


Marketing companies have collected data implying that teenage girls 
use more ring tones on their cellular phones than teenage boys do. In 
one particular study of 40 randomly chosen teenage girls and boys (20 
of each) with cellular phones, the mean number of ring tones for the 
girls was 3.2 with a standard deviation of 1.5. The mean for the boys 
was 1.7 with a standard deviation of 0.8. Conduct a hypothesis test to 
determine if the means are approximately the same or if the girls’ 
mean is higher than the boys’ mean. 


Use the information from Appendix C: Data Sets to answer the next four 
exercises. 
Exercise: 


Problem: 
Using the data from Lap 1 only, conduct a hypothesis test to determine 


if the mean time for completing a lap in races is the same as it is in 
practices. 


Solution: 
a. Hy : M1 = be 
b. i, > fy a L2 


c. X, — X¢ is the difference between the mean times for 
completing a lap in races and in practices. 

d. ty9.32 

e. test statistic: -4.70 

f. p-value: 0.0001 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis. 
iii. Reason for Decision: p-value < alpha 
iv. Conclusion: At the 5% significance level, there is sufficient 
evidence to conclude that the mean time for completing a lap 


in races is different from that in practices. 
Exercise: 


Problem: Repeat the test in [link], but use Lap 5 data this time. 
Exercise: 
Problem: 


Repeat the test in [link], but this time combine the data from Laps 1 
and 5. 


Solution: 
a. Hy : b1 = be 
b. i; > Py a L2 


c. is the difference between the mean times for completing a lap in 
races and in practices. 

d. t40.94 

e. test statistic: —5.08 

f. p-value: zero 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis. 
iii. Reason for Decision: p-value < alpha 
iv. Conclusion: At the 5% significance level, there is sufficient 
evidence to conclude that the mean time for completing a lap 
in races is different from that in practices. 


Exercise: 
Problem: 
In two to three complete sentences, explain in detail how you might 


use Terri Vogel’s data to answer the following question. “Does Terri 
Vogel drive faster in races than she does in practices?” 


Use the following information to answer the next two exercises. The Eastern 
and Western Major League Soccer conferences have a new Reserve 
Division that allows new players to develop their skills. Data for a 
randomly picked date showed the following annual goals. 


Western Eastern 

Los Angeles 9 D.C. United 9 
FC Dallas 3 Chicago 8 
Chivas USA 4 Columbus 7 
Real Salt Lake 3 New England 6 
Colorado 4 MetroStars 5 
San Jose 4 Kansas City 3 


Conduct a hypothesis test to answer the next two exercises. 
Exercise: 


Problem: The exact distribution for the hypothesis test is: 


a. the normal distribution 

b. the Student's t-distribution 
c. the uniform distribution 

d. the exponential distribution 


Exercise: 


Problem: If the level of significance is 0.05, the conclusion is: 


a. There is sufficient evidence to conclude that the W Division 
teams score fewer goals, on average, than the E teams 

b. There is insufficient evidence to conclude that the W Division 
teams score more goals, on average, than the E teams. 

c. There is insufficient evidence to conclude that the W teams score 
fewer goals, on average, than the E: teams score. 

d. Unable to determine 


Solution: 


Cc 
Exercise: 


Problem: 


Suppose a Statistics instructor believes that there is no significant 
difference between the mean class scores of statistics day students on 
Exam 2 and statistics night students on Exam 2. She takes random 
samples from each of the populations. The mean and standard 
deviation for 35 statistics day students were 75.86 and 16.91. The 
mean and standard deviation for 37 statistics night students were 75.41 
and 19.73. The “day” subscript refers to the statistics day students. The 
“night” subscript refers to the statistics night students. A concluding 
statement is: 


a. There is sufficient evidence to conclude that statistics night 
students' mean on Exam 2 is better than the statistics day students’ 
mean on Exam 2. 

b. There is insufficient evidence to conclude that the statistics day 
students' mean on Exam 2 is better than the statistics night 
students' mean on Exam 2. 

c. There is insufficient evidence to conclude that there is a 
significant difference between the means of the statistics day 
students and night students on Exam 2. 


d. There is sufficient evidence to conclude that there is a significant 
difference between the means of the statistics day students and 
night students on Exam 2. 


Exercise: 


Problem: 


Researchers interviewed street prostitutes in Canada and the United 
States. The mean age of the 100 Canadian prostitutes upon entering 
prostitution was 18 with a standard deviation of six. The mean age of 
the 130 United States prostitutes upon entering prostitution was 20 
with a standard deviation of eight. Is the mean age of entering 
prostitution in Canada lower than the mean age in the United States? 
Test at a 1% significance level. 


Solution: 


Test: two independent sample means, population standard deviations 
unknown. 


Random variable: 

X,— X2 
Distribution: Hp : uw, = Wee : bi < MeHo: by = bo He: py < Ho The 
mean age of entering prostitution in Canada is lower than the mean age 
in the United States. 
Graph: left-tailed 
p-value : 0.0151 


Decision: Cannot reject Ho. 


Conclusion: At the 1% level of significance, from the sample data, 
there is not sufficient evidence to conclude that the mean age of 


entering prostitution in Canada is lower than the mean age in the 
United States. 


Exercise: 


Problem: 


A powder diet is tested on 49 people, and a liquid diet is tested on 36 
different people. Of interest is whether the liquid diet yields a higher 
mean weight loss than the powder diet. The powder diet group had a 
mean weight loss of 42 pounds with a standard deviation of 12 pounds. 
The liquid diet group had a mean weight loss of 45 pounds with a 
standard deviation of 14 pounds. 


Exercise: 


Problem: 


Suppose a Statistics instructor believes that there is no significant 
difference between the mean class scores of statistics day students on 
Exam 2 and statistics night students on Exam 2. She takes random 
samples from each of the populations. The mean and standard 
deviation for 35 statistics day students were 75.86 and 16.91, 
respectively. The mean and standard deviation for 37 statistics night 
students were 75.41 and 19.73. The “day” subscript refers to the 
Statistics day students. The “night” subscript refers to the statistics 
night students. An appropriate alternative hypothesis for the hypothesis 
test is: 


a. Uday Hnight 
b. Hday as Hnight 
C. Uday — Hnight 
d. Hday 7 Hnight 


Solution: 


d 


Glossary 


Cohen’s d 
a measure of effect size based on the differences between two means. 
If d is between 0 and 0.2 then the effect is small. If d approaches is 0.5, 
then the effect is medium, and if d approaches 0.8, then it is a large 
effect. 


Pooled Variance 
a weighted average of two variances that can then be used when 
calculating standard error. 


Comparing Two Independent Population Proportions 


When conducting a hypothesis test that compares two independent 
population proportions, the following characteristics should be present: 


1. The two independent samples are random samples that are 
independent. 

2. The number of successes is at least five, and the number of failures is 
at least five, for each of the samples. 

3. Growing literature states that the population must be at least ten or 
even perhaps 20 times the size of the sample. This keeps each 
population from being over-sampled and causing biased results. 


Comparing two proportions, like comparing two means, is common. If two 
estimated proportions are different, it may be due to a difference in the 
populations or it may be due to chance in the sampling. A hypothesis test 
can help determine if a difference in the estimated proportions reflects a 
difference in the two population proportions. 


Like the case of differences in sample means, we construct a sampling 
distribution for differences in sample proportions: (p’, — p',)where 
p', = X « and pp = X_z_are the sample proportions for the two sets of 


data in question. Xa and Xz are the number of successes in each sample 
group respectively, and ng and ng are the respective sample sizes from the 
two groups. Again we go the Central Limit theorem to find the distribution 
of this sampling distribution for the differences in sample proportions. And 
again we find that this sampling distribution, like the ones past, are 
normally distributed as proved by the Central Limit Theorem, as seen in 
[link] . 


POPULATION 1 POPULATION 2 


SAMPLING DISTRIBUTION 


(Pe 
My -p, 
Z 
10) 
Ho: P, - P2= dy 
H,: P, - P, #5, 


Generally, the null hypothesis allows for the test of a difference of a 
particular value, do, just as we did for the case of differences in means. 
Equation: 


Ho : pi — po = 50 
Equation: 
A : pi — p2 F 00 


Most common, however, is the test that the two proportions are the same. 
That is, 


Equation: 
Ho: pa = pB 
Equation: 


Ha: Pa F~PB 


To conduct the test, we use a pooled proportion, Dc. 
Equation: 
The pooled proportion is calculated as follows: 
LAT LB 
NA+NB 


Cc ——, 


Equation: 
The test statistic (z-score) is: 


— pp) — do 


et ores a ae) 


where 6p is the hypothesized differences between the two proportions and 
Dc is the pooled variance from the formula above. 


Example: 
Exercise: 


Problem: 


A bank has recently acquired a new branch and thus has customers in 
this new territory. They are interested in the default rate in their new 
territory. They wish to test the hypothesis that the default rate is 
different from their current customer base. They sample 200 files in 
area A, their current customers, and find that 20 have defaulted. In 
area B, the new customers, another sample of 200 files shows 12 have 
defaulted on their loans. At a 10% level of significance can we say 
that the default rates are the same or different? 


Solution: 


This is a test of proportions. We know this because the underlying 
random variable is binary, default or not default. Further, we know it 
is a test of differences in proportions because we have two sample 
groups, the current customer base and the newly acquired customer 
base. Let A and B be the subscripts for the two customer groups. Then 
Pa and pz are the two population proportions we wish to test. 


Random Variable: 
P', — P'z = difference in the proportions of customers who defaulted 
in the two groups. 


Hy: pA = PB 
Hy: pa # PB 
The words "is a difference" tell you the test is two-tailed. 


Distribution for the test: Since this is a test of two binomial 
population proportions, the distribution is normal: 


_ hors — AVE == 
Pc = made — 200-200 = 0.08 ip. — 0.92 


(p', — P'z) = 0.04 follows an approximate normal distribution. 


Estimated proportion for group A: p’ 4 = ov = Sit == (i) 1) 


Estimated proportion for group B: p’ p = ae = sy = 0.06 


The estimated difference between the two groups is : p', — p'z = 0.1 — 
0.06 = 0.04. 


% = 0.05 


-1.645 g 0.54 1.645 z 
HP, =P, 
AF FP. 
Equation: 


The calculated test statistic is .54 and is not in the tail of the 
distribution. 


Make a decision: Since the calculate test statistic is not in the tail of 
the distribution we cannot reject Ho. 


Conclusion: At a 1% level of significance, from the sample data, 
there is not sufficient evidence to conclude that there is a difference 
between the proportions of customers who defaulted in the two 
groups. 


Note: 
Try It 
Exercise: 


Problem: 


Two types of valves are being tested to determine if there is a 
difference in pressure tolerances. Fifteen out of a random sample of 
100 of Valve A cracked under 4,500 psi. Six out of a random sample 
of 100 of Valve B cracked under 4,500 psi. Test at a 5% level of 
significance. 


Solution: 


The p-value is 0.0379, so we can reject the null hypothesis. At the 5% 
significance level, the data support that there is a difference in the 
pressure tolerances between the two valves. 
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Chapter Review 
Test of two population proportions from independent samples. 


e Random variable: p’ ,— p’ ,z = difference between the two estimated 
proportions 
e Distribution: normal distribution 


Formula Review 


LAa+LBp 
nat+nep 


Pooled Proportion: p, = 


(p' a—p'B) 


y[pett-a (e+e) 


Test Statistic (z-score): Z,. = 


where 


pa, and pp are the sample proportions, p4 and pgare the population 
proportions, 


P, is the pooled proportion, and ny and ng are the sample sizes. 


Use the following information for the next five exercises. Two types of 
phone operating system are being tested to determine if there is a difference 
in the proportions of system failures (crashes). Fifteen out of a random 


sample of 150 phones with OS, had system failures within the first eight 
hours of operation. Nine out of another random sample of 150 phones with 
OS, had system failures within the first eight hours of operation. OS> is 
believed to be more stable (have fewer crashes) than OS. 

Exercise: 


Problem: Is this a test of means or proportions? 


Exercise: 


Problem: What is the random variable? 


Solution: 


P'9s1 — P'o0s2 = difference in the proportions of phones that had system 
failures within the first eight hours of operation with OS, and OS». 


Exercise: 


Problem: State the null and alternative hypotheses. 


Exercise: 


Problem: What can you conclude about the two operating systems? 


Use the following information to answer the next twelve exercises. In the 
recent Census, three percent of the U.S. population reported being of two or 
more races. However, the percent varies tremendously from state to state. 
Suppose that two random surveys are conducted. In the first random survey, 
out of 1,000 North Dakotans, only nine people reported being of two or 
more races. In the second random survey, out of 500 Nevadans, 17 people 
reported being of two or more races. Conduct a hypothesis test to determine 
if the population percents are the same for the two states or if the percent 
for Nevada is statistically higher than for North Dakota. 

Exercise: 


Problem: Is this a test of means or proportions? 
Solution: 
proportions 
Exercise: 
Problem: State the null and alternative hypotheses. 


a. Ho: 
b. Hg: 


Exercise: 
Problem: 
Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 
Solution: 
right-tailed 


Exercise: 


Problem: What is the random variable of interest for this test? 


Exercise: 


Problem: In words, define the random variable for this test. 


Solution: 


The random variable is the difference in proportions (percents) of the 
populations that are of two or more races in Nevada and North Dakota. 


Exercise: 


Problem: 
Which distribution (normal or Student's t) would you use for this 
hypothesis test? 

Exercise: 


Problem: 


Explain why you chose the distribution you did for the Exercise 10.56. 


Solution: 
Our sample sizes are much greater than five each, so we use the 
normal for two proportions distribution for this hypothesis test. 


Exercise: 


Problem: Calculate the test statistic. 


Exercise: 


Problem: At a pre-conceived a = 0.05, what is your: 


a. Decision: 
b. Reason for the decision: 
c. Conclusion (write out in a complete sentence): 


Solution: 


a. Cannot accept the null hypothesis. 

b. p-value < alpha 

c. At the 5% significance level, there is sufficient evidence to 
conclude that the proportion (percent) of the population that is of 
two or more races in Nevada is statistically higher than that in 
North Dakota. 


Exercise: 


Problem: 


Does it appear that the proportion of Nevadans who are two or more 
races is higher than the proportion of North Dakotans? Why or why 
not? 


Homework 


Exercise: 


Problem: 


A recent drug survey showed an increase in the use of drugs and 
alcohol among local high school seniors as compared to the national 
percent. Suppose that a survey of 100 local seniors and 100 national 
seniors is conducted to see if the proportion of drug and alcohol use is 
higher locally than nationally. Locally, 65 seniors reported using drugs 
or alcohol within the past month, while 60 national seniors reported 
using them. 


Exercise: 


Problem: 


We are interested in whether the proportions of female suicide victims 
for ages 15 to 24 are the same for the whites and the blacks races in the 
United States. We randomly pick one year, 1992, to compare the races. 
The number of suicides estimated in the United States in 1992 for 
white females is 4,930. Five hundred eighty were aged 15 to 24. The 
estimate for black females is 330. Forty were aged 15 to 24. We will 
let female suicide victims be our population. 


Solution: 
a. Ho: Pw = Pz 
b. Hi: Pw Z Pz 


c. The random variable is the difference in the proportions of white 
and black suicide victims, aged 15 to 24. 


d. normal for two proportions 
e. test statistic: -0.1944 

f. p-value: 0.8458 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis. 
iii. Reason for decision: p-value > alpha 
iv. Conclusion: At the 5% significance level, there is 
insufficient evidence to conclude that the proportions of 
white and black female suicide victims, aged 15 to 24, are 
different. 


Exercise: 


Problem: 


Elizabeth Mjelde, an art history professor, was interested in whether 


the value from the Golden Ratio formula, ( zee tangs depen, | 


larger dimension 

was the same in the Whitney Exhibit for works from 1900 to 1919 as 
for works from 1920 to 1942. Thirty-seven early works were sampled, 
averaging 1.74 with a standard deviation of 0.11. Sixty-five of the later 
works were sampled, averaging 1.746 with a standard deviation of 
0.1064. Do you think that there is a significant difference in the 
Golden Ratio calculation? 


Exercise: 


Problem: 


A recent year was randomly picked from 1985 to the present. In that 
year, there were 2,051 Hispanic students at Cabrillo College out of a 
total of 12,328 students. At Lake Tahoe College, there were 321 
Hispanic students out of a total of 2,441 students. In general, do you 
think that the percent of Hispanic students at the two colleges is 
basically the same or different? 


Solution: 


Subscripts: 1 = Cabrillo College, 2 = Lake Tahoe College 


a. Hy : pi = po 

b. a: Pi a P2 

c. The random variable is the difference between the proportions of 
Hispanic students at Cabrillo College and Lake Tahoe College. 

d. normal for two proportions 

e. test statistic: 4.29 

f. p-value: 0.00002 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis. 
iii. Reason for decision: p-value < alpha 
iv. Conclusion: There is sufficient evidence to conclude that the 
proportions of Hispanic students at Cabrillo College and 
Lake Tahoe College are different. 


Use the following information to answer the next three exercises. 
Neuroinvasive West Nile virus is a severe disease that affects a person’s 
nervous system . It is spread by the Culex species of mosquito. In the 
United States in 2010 there were 629 reported cases of neuroinvasive West 
Nile virus out of a total of 1,021 reported cases and there were 486 
neuroinvasive reported cases out of a total of 712 cases reported in 2011. Is 
the 2011 proportion of neuroinvasive West Nile virus cases more than the 
2010 proportion of neuroinvasive West Nile virus cases? Using a 1% level 
of significance, conduct an appropriate hypothesis test. 


e “2011” subscript: 2011 group. 
e “2010” subscript: 2010 group 


Exercise: 


Problem: This is: 


a. a test of two proportions 


b. a test of two independent means 
c. a test of a single mean 
d. a test of matched pairs. 


Exercise: 


Problem: An appropriate null hypothesis is: 


a. P2011 S P2010 
b. P2011 2 P2010 
C. H2011 S H2010 
d. P2011 > P2010 


Solution: 


a 

Exercise: 
Problem: 
Researchers conducted a study to find out if there is a difference in the 
use of eReaders by different age groups. Randomly selected 
participants were divided into two age groups. In the 16- to 29-year- 


old group, 7% of the 628 surveyed use eReaders, while 11% of the 
2,309 participants 30 years old and older use eReaders. 


Solution: 
Test: two independent sample proportions. 
Random variable: p’, - p'> 


Distribution: 
Ho: pi = po 
Ay: pi F p2 


The proportion of eReader users is different for the 16- to 29-year-old 
users from that of the 30 and older users. 


Graph: two-tailed 
Exercise: 


Problem: 


Adults aged 18 years old and older were randomly selected for a 
survey on obesity. Adults are considered obese if their body mass 
index (BMI) is at least 30. The researchers wanted to determine if the 
proportion of women who are obese in the south is less than the 
proportion of southern men who are obese. The results are shown in 
[link]. Test at the 1% level of significance. 


Number who are obese Sample size 
Men 42,769 155,525 
Women 67,169 248,775 
Exercise: 
Problem: 


Two computer users were discussing tablet computers. A higher 
proportion of people ages 16 to 29 use tablets than the proportion of 
people age 30 and older. [link] details the number of tablet owners for 
each age group. Test at the 1% level of significance. 


16-29 year olds 30 years old and older 


Own a tablet 69 231 
Sample size 628 2,309 
Solution: 


Test: two independent sample proportions 
Random variable: p'; — p'» 
Distribution: 


Ho: pi = po 
A, : pi > peo 


A higher proportion of tablet owners are aged 16 to 29 years old than 
are 30 years old and older. 


Graph: right-tailed 
Do not reject the Ho. 


Conclusion: At the 1% level of significance, from the sample data, 
there is not sufficient evidence to conclude that a higher proportion of 
tablet owners are aged 16 to 29 years old than are 30 years old and 
older. 


Exercise: 


Problem: 


A group of friends debated whether more men use smartphones than 
women. They consulted a research study of smartphone use among 
adults. The results of the survey indicate that of the 973 men randomly 
sampled, 379 use smartphones. For women, 404 of the 1,304 who were 
randomly sampled use smartphones. Test at the 5% level of 
significance. 


Exercise: 


Problem: 


While her husband spent 2% hours picking out new speakers, a 
statistician decided to determine whether the percent of men who 
enjoy shopping for electronic equipment is higher than the percent of 
women who enjoy shopping for electronic equipment. The population 
was Saturday afternoon shoppers. Out of 67 men, 24 said they enjoyed 
the activity. Eight of the 24 women surveyed claimed to enjoy the 
activity. Interpret the results of the survey. 


Solution: 
Subscripts: 1: men; 2: women 


a. Ho : pi < pe 

b. Ha: pi > po 

c. P'; — P's is the difference between the proportions of men and 
women who enjoy shopping for electronic equipment. 

d. normal for two proportions 

e. test statistic: 0.22 

f. p-value: 0.4133 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot reject the null hypothesis. 
iii. Reason for Decision: p-value > alpha 


iv. Conclusion: At the 5% significance level, there is 
insufficient evidence to conclude that the proportion of men 
who enjoy shopping for electronic equipment is more than 
the proportion of women. 


Exercise: 


Problem: 


We are interested in whether children’s educational computer software 
costs less, on average, than children’s entertainment software. Thirty- 
six educational software titles were randomly picked from a catalog. 
The mean cost was $31.14 with a standard deviation of $4.69. Thirty- 
five entertainment software titles were randomly picked from the same 
catalog. The mean cost was $33.86 with a standard deviation of 
$10.87. Decide whether children’s educational software costs less, on 
average, than children’s entertainment software. 


Exercise: 


Problem: 


Joan Nguyen recently claimed that the proportion of college-age males 
with at least one pierced ear is as high as the proportion of college-age 
females. She conducted a survey in her classes. Out of 107 males, 20 
had at least one pierced ear. Out of 92 females, 47 had at least one 
pierced ear. Do you believe that the proportion of males has reached 
the proportion of females? 


Solution: 
a. Ho : pi = pe 
b. A S Pi a P2 


c. P',; — P's is the difference between the proportions of men and 
women that have at least one pierced ear. 

d. normal for two proportions 

e. test statistic: 4.82 

f. p-value: zero 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis. 
iii. Reason for Decision: p-value < alpha 
iv. Conclusion: At the 5% significance level, there is sufficient 
evidence to conclude that the proportions of males and 
females with at least one pierced ear is different. 


Exercise: 


Problem: "To Breakfast or Not to Breakfast?" by Richard Ayore 


In the American society, birthdays are one of those days that everyone 
looks forward to. People of different ages and peer groups gather to 
mark the 18th, 20th, ..., birthdays. During this time, one looks back to 
see what he or she has achieved for the past year and also focuses 
ahead for more to come. 


If, by any chance, I am invited to one of these parties, my experience is 
always different. Instead of dancing around with my friends while the 
music is booming, I get carried away by memories of my family back 
home in Kenya. I remember the good times I had with my brothers and 
sister while we did our daily routine. 


Every morning, I remember we went to the shamba (garden) to weed 
our crops. I remember one day arguing with my brother as to why he 
always remained behind just to join us an hour later. In his defense, he 
said that he preferred waiting for breakfast before he came to weed. He 
said, “This is why I always work more hours than you guys!” 


And so, to prove him wrong or right, we decided to give it a try. One 
day we went to work as usual without breakfast, and recorded the time 
we could work before getting tired and stopping. On the next day, we 
all ate breakfast before going to work. We recorded how long we 
worked again before getting tired and stopping. Of interest was our 
mean increase in work time. Though not sure, my brother insisted that 
it was more than two hours. Using the data in [link], solve our 
problem. 


Work hours with Work hours without 


breakfast breakfast 
8 6 
7 5 
9 5 
5 4 
9 7. 
8 7 
10 7 
7 5 
6 6 
9 5 
Solution: 
a. Ho: Ug = 0 
b. Hg: Ug > 0 


c. The random variable Xj, is the mean difference in work times on 
days when eating breakfast and on days when not eating 
breakfast. 

d. ty 

e. test statistic: 4.8963 

f. p-value: 0.0004 

g. Check student’s solution. 


. Alpha: 0.05 

. Decision: Cannot accept the null hypothesis. 

i. Reason for Decision: p-value < alpha 

. Conclusion: At the 5% level of significance, there is 


sufficient evidence to conclude that the mean difference in 
work times on days when eating breakfast and on days when 
not eating breakfast has increased. 


Two Population Means with Known Standard Deviations 


Even though this situation is not likely (knowing the population standard deviations 
is very unlikely), the following example illustrates hypothesis testing for 
independent means with known population standard deviations. The sampling 
distribution for the difference between the means is normal in accordance with the 


central limit theorem. The random variable is i= Xs. The normal distribution has 
the following format: 


Equation: 
The standard deviation is: 
2 2 
or o 
i 1)? (2) 
ny n2 
Equation: 
The test statistic (z-score) is: 
L1-£2)—-6 
zy, — _(@rr)~Hi 
(o1)" | (02) 
oe 
Example: 


Independent groups, population standard deviations known: The mean lasting 

time of two competing floor waxes is to be compared. Twenty floors are randomly 
assigned to test each wax. Both populations have a normal distributions. The data 

are recorded in [link]. 


Sample mean number of months Population standard 
Wax floor wax lasts deviation 


1 3 0.33 


Sample mean number of months Population standard 


Wax floor wax lasts deviation 

2 2.9 0.36 
Exercise: 

Problem: 


Does the data indicate that wax 1 is more effective than wax 2? Test at a 5% 
level of significance. 


Solution: 


This is a test of two independent groups, two population means, population 
standard deviations known. 


Random Variable: X 1- xX 9 = difference in the mean number of months the 
competing floor waxes last. 


Ho : 1 S pe 
ei phe 


The words "is more effective" says that wax 1 lasts longer than wax 2, on 
average. "Longer" is a “>” symbol and goes into H,. Therefore, this is a right- 
tailed test. 


Distribution for the test: The population standard deviations are known so 
the distribution is normal. Using the formula for the test statistic we find the 
calculated value for the problem. 

Equation: 


(41 — Ha) — 40 


2 2 
si Sipe ee 


TAs =U 


n1 n2 


a= 0.05 


Ho: H, $y, 


HH, > 


The estimated difference between he two means is : X i- xX 9 =3-2.9=0.1 


Compare calculated value and critical value and Z,: We mark the 
calculated value on the graph and find the the calculate value is not in the tail 
therefore we cannot reject the null hypothesis. 


Make a decision: the calculated value of the test statistic is not in the tail, 
therefore you cannot reject Hp. 


Conclusion: At the 5% level of significance, from the sample data, there is not 
sufficient evidence to conclude that the mean time wax 1 lasts is longer (wax 1 
is more effective) than the mean time wax 2 lasts. 


Note: 
Try It 
Exercise: 


Problem: 


The means of the number of revolutions per minute of two competing engines 
are to be compared. Thirty engines are randomly assigned to be tested. Both 
populations have normal distributions. [link] shows the result. Do the data 
indicate that Engine 2 has higher RPM than Engine 1? Test at a 5% level of 
significance. 


Sample mean number of Population standard 


Engine RPM deviation 

1 1,500 50 

Z 1,600 60 
Solution: 


The p-value is almost zero, so we reject the null hypothesis. There is sufficient 
evidence to conclude that Engine 2 runs at a higher RPM than Engine 1. 


Example: 

An interested citizen wanted to know if Democratic U. S. senators are older than 
Republican U.S. senators, on average. On May 26 2013, the mean age of 30 
randomly selected Republican Senators was 61 years 247 days old (61.675 years) 
with a standard deviation of 10.17 years. The mean age of 30 randomly selected 
Democratic senators was 61 years 257 days old (61.704 years) with a standard 
deviation of 9.55 years. 

Exercise: 


Problem: 


Do the data indicate that Democratic senators are older than Republican 
senators, on average? Test at a 5% level of significance. 


Solution: 


This is a test of two independent groups, two population means. The 
population standard deviations are unknown, but the sum of the sample sizes is 
30 + 30 = 60, which is greater than 30, so we can use the normal 
approximation to the Student’s-t distribution. Subscripts: 1: Democratic 
senators 2: Republican senators 


Random variable: X 1- De 2 = difference in the mean age of Democratic and 
Republican U.S. senators. 


Joly etal ey ees yin, avi) ae) 


Hoa ps Ao — fo 0 


The words "older than" translates as a “>” symbol and goes into H,. Therefore, 
this is a right-tailed test. 


X1—X2 


Make a decision: The p-value is larger than 5%, therefore we cannot reject the 
null hypothesis. By calculating the test statistic we would find that the test 
statistic does not fall in the tail, therefore we cannot reject the null hypothesis. 
We reach the same conclusion using either method of a making this statistical 
decision. 


Conclusion: At the 5% level of significance, from the sample data, there is not 
sufficient evidence to conclude that the mean age of Democratic senators is 
greater than the mean age of the Republican senators. 
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Chapter Review 


A hypothesis test of two population means from independent samples where the 
population standard deviations are known (typically approximated with the sample 
standard deviations), will have these characteristics: 


¢ Random variable: X 1- x 9 = the difference of the means 
e Distribution: normal distribution 


Formula Review 
Test Statistic (z-score): 


Ze = (21—2)—d0 
(04)? 2 
my hg 


(22 


where: 
o 1 and o2 are the known population standard deviations. n, and n> are the sample 
sizes. £1 and x2 are the sample means. py, and ply are the population means. 


Use the following information to answer the next five exercises. The mean speeds of 
fastball pitches from two different baseball pitchers are to be compared. A sample of 
14 fastball pitches is measured from each pitcher. The populations have normal 


distributions. [link] shows the result. Scouters believe that Rodriguez pitches a 
speedier fastball. 


Sample mean speed of Population standard 
Pitcher pitches (mph) deviation 
Wesley 86 3 
Rodriguez 91 7 


Exercise: 


Problem: What is the random variable? 


Solution: 


The difference in mean speeds of the fastball pitches of the two pitchers 


Exercise: 


Problem: State the null and alternative hypotheses. 


Exercise: 


Problem: What is the test statistic? 


Solution: 


—2.46 


Exercise: 


Problem: At the 1% significance level, what is your conclusion? 


Solution: 


At the 1% significance level, we can reject the null hypothesis. There is 
sufficient data to conclude that the mean speed of Rodriguez’s fastball is faster 
than Wesley’s. 


Use the following information to answer the next five exercises. A researcher is 
testing the effects of plant food on plant growth. Nine plants have been given the 
plant food. Another nine plants have not been given the plant food. The heights of 
the plants are recorded after eight weeks. The populations have normal distributions. 
The following table is the result. The researcher thinks the food makes the plants 
grow taller. 


Plant Sample mean height of plants Population standard 
group (inches) deviation 
Food 16 25 
No food 14 1.5 
Exercise: 


Problem: Is the population standard deviation known or unknown? 


Exercise: 
Problem: State the null and alternative hypotheses. 
Solution: 
Subscripts: 1 = Food, 2 = No Food 


Hy: M1 < pe 
Hy: fy > pe 


Exercise: 


Problem: At the 1% significance level, what is your conclusion? 


Use the following information to answer the next five exercises. Two metal alloys are 
being considered as material for ball bearings. The mean melting point of the two 
alloys is to be compared. 15 pieces of each metal are being tested. Both populations 
have normal distributions. The following table is the result. It is believed that Alloy 
Zeta has a different melting point. 


Sample mean melting Population standard 
temperatures (°F) deviation 
py 800 95 
Gamma 
Alloy 
Zeta 900 105 
Exercise: 


Problem: State the null and alternative hypotheses. 
Solution: 


Subscripts: 1 = Gamma, 2 = Zeta 


Ho : 1 = pe 
A, > fy # He 
Exercise: 


Problem: Is this a right-, left-, or two-tailed test? 


Exercise: 


Problem: At the 1% significance level, what is your conclusion? 


Solution: 


There is sufficient evidence so we cannot accept the null hypothesis. The data 
support that the melting point for Alloy Zeta is different from the melting point 
of Alloy Gamma. 


Homework 


Note: 

Note 

If you are using a Student's t-distribution for one of the following homework 
problems, including for paired data, you may assume that the underlying population 
is normally distributed. (When using these tests in a real situation, you must first 
prove that assumption, however.) 


Exercise: 


Problem: 


A study is done to determine if students in the California state university system 
take longer to graduate, on average, than students enrolled in private 
universities. One hundred students from both the California state university 
system and private universities are surveyed. Suppose that from years of 
research, it is known that the population standard deviations are 1.5811 years 
and 1 year, respectively. The following data are collected. The California state 
university system students took on average 4.5 years with a standard deviation 
of 0.8. The private university students took on average 4.1 years with a standard 
deviation of 0.3. 


Exercise: 


Problem: 


Parents of teenage boys often complain that auto insurance costs more, on 
average, for teenage boys than for teenage girls. A group of concerned parents 
examines a random sample of insurance bills. The mean annual cost for 36 
teenage boys was $679. For 23 teenage girls, it was $559. From past years, it is 
known that the population standard deviation for each group is $180. Determine 
whether or not you believe that the mean cost for auto insurance for teenage 
boys is greater than that for teenage girls. 


Solution: 
Subscripts: 1 = boys, 2 = girls 


a. Ho : 1 < pe 

b. Hg : My > Me 

c. The random variable is the difference in the mean auto insurance costs for 
boys and girls. 

d. normal 

e. test statistic: z = 2.50 

f. p-value: 0.0062 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis. 
iii. Reason for Decision: p-value < alpha 
iv. Conclusion: At the 5% significance level, there is sufficient evidence 
to conclude that the mean cost of auto insurance for teenage boys is 
greater than that for girls. 


Exercise: 


Problem: 


A group of transfer bound students wondered if they will spend the same mean 
amount on texts and supplies each year at their four-year university as they 
have at their community college. They conducted a random survey of 54 
students at their community college and 66 students at their local four-year 
university. The sample means were $947 and $1,011, respectively. The 
population standard deviations are known to be $254 and $87, respectively. 
Conduct a hypothesis test to determine if the means are statistically the same. 


Exercise: 


Problem: 


Some manufacturers claim that non-hybrid sedan cars have a lower mean miles- 
per-gallon (mpg) than hybrid ones. Suppose that consumers test 21 hybrid 
sedans and get a mean of 31 mpg with a standard deviation of seven mpg. 
Thirty-one non-hybrid sedans get a mean of 22 mpg with a standard deviation 
of four mpg. Suppose that the population standard deviations are known to be 
six and three, respectively. Conduct a hypothesis test to evaluate the 
manufacturers claim. 


Solution: 


Subscripts: 1 = non-hybrid sedans, 2 = hybrid sedans 


c. The random variable is the difference in the mean miles per gallon of non- 
hybrid sedans and hybrid sedans. 

d. normal 

e. test statistic: 6.36 

f. p-value: 0 

g. Check student’s solution. 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis. 
iii. Reason for decision: p-value < alpha 
iv. Conclusion: At the 5% significance level, there is sufficient evidence 
to conclude that the mean miles per gallon of non-hybrid sedans is 
less than that of hybrid sedans. 


Exercise: 


Problem: 


A baseball fan wanted to know if there is a difference between the number of 
games played in a World Series when the American League won the series 
versus when the National League won the series. From 1922 to 2012, the 
population standard deviation of games won by the American League was 1.14, 
and the population standard deviation of games won by the National League 
was 1.11. Of 19 randomly selected World Series games won by the American 
League, the mean number of games won was 5.76. The mean number of 17 
randomly selected games won by the National League was 5.42. Conduct a 
hypothesis test. 


Exercise: 


Problem: 


One of the questions in a study of marital satisfaction of dual-career couples 
was to rate the statement “I’m pleased with the way we divide the 
responsibilities for childcare.” The ratings went from one (strongly agree) to 
five (strongly disagree). [link] contains ten of the paired responses for husbands 
and wives. Conduct a hypothesis test to see if the mean difference in the 
husband’s versus the wife’s satisfaction level is negative (meaning that, within 
the partnership, the husband is happier than the wife). 


Wife’s 
score 


Husband’s 
score 


Solution: 


a. Ho: Ug = 0 

b. Hy pa <0 

c. The random variable Xj, is the average difference between husband’s and 
wife’s satisfaction level. 

d. to 

e. test statistic: t = —1.86 

f. p-value: 0.0479 

g. Check student’s solution 


h. i. Alpha: 0.05 
ii. Decision: Cannot accept the null hypothesis, but run another test. 
iii. Reason for Decision: p-value < alpha 
iv. Conclusion: This is a weak test because alpha and the p-value are 
close. However, there is insufficient evidence to conclude that the 
mean difference is negative. 


Matched or Paired Samples 


In most cases of economic or business data we have little or no control over the process 
of how the data are gathered. In this sense the data are not the result of a planned 
controlled experiment. In some cases, however, we can develop data that are part of a 
controlled experiment. This situation occurs frequently in quality control situations. 
Imagine that the production rates of two machines built to the same design, but at 
different manufacturing plants, are being tested for differences in some production 
metric such as speed of output or meeting some production specification such as 
strength of the product. The test is the same in format to what we have been testing, but 
here we can have matched pairs for which we can test if differences exist. Each 
observation has its matched pair against which differences are calculated. First, the 
differences in the metric to be tested between the two lists of observations must be 
calculated, and this is typically labeled with the letter "d." Then, the average of these 
matched differences, X qg is calculated as is its standard deviation, Sg. We expect that the 
standard deviation of the differences of the matched pairs will be smaller than 
unmatched pairs because presumably fewer differences should exist because of the 
correlation between the two groups. 


When using a hypothesis test for matched or paired samples, the following 
characteristics may be present: 


1. Simple random sampling is used. 

2. Sample sizes are often small. 

3. Two measurements (samples) are drawn from the same pair of individuals or 
objects. 

4. Differences are calculated from the matched or paired samples. 

5. The differences form the sample that is used for the hypothesis test. 

6. Either the matched pairs have differences that come from a population that is 
normal or the number of differences is sufficiently large so that distribution of the 
sample mean of differences is approximately normal. 


In a hypothesis test for matched or paired samples, subjects are matched in pairs and 
differences are calculated. The differences are the data. The population mean for the 
differences, pg, is then tested using a Student's-t test for a single population mean with n 
— 1 degrees of freedom, where n is the number of differences, that is, the number of 
pairs not the number of observations. 
Equation: 

The null and alternative hypotheses for this test are: 


Ao: va = 90 


Equation: 


Ay: Wa ~ 0 


Equation: 
The test statistic is: 
Ld — bed 
t= d— ML 
_Sd_ 

(<5) 
Example: 
Exercise: 

Problem: 


A company has developed a training program for its entering employees because 
they have become concerned with the results of the six-month employee review. 
They hope that the training program can result in better six-month reviews. Each 
trainee constitutes a “pair”, the entering score the employee received when first 
entering the firm and the score given at the six-month review. The difference in 
the two scores were calculated for each employee and the means for before and 
after the training program was calculated. The sample mean before the training 
program was 20.4 and the sample mean after the training program was 23.9. The 
standard deviation of the differences in the two scores across the 20 employees 
was 3.8 points. Test at the 10% significance level the null hypothesis that the two 
population means are equal against the alternative that the training program helps 
improve the employees’ scores. 


Solution: 


The first step is to identify this as a two sample case: before the training and after 
the training. This differentiates this problem from simple one sample issues. 
Second, we determine that the two samples are "paired." Each observation in the 
first sample has a paired observation in the second sample. This information tells 
us that the null and alternative hypotheses should be: 

Equation: 


Ao: Wa < 0 
Equation: 


Ay: Ua > 0 


This form reflects the implied claim that the training course improves scores; the 
test is one-tailed and the claim is in the alternative hypothesis. Because the 
experiment was conducted as a matched paired sample rather than simply taking 
scores from people who took the training course those who didn't, we use the 
matched pair test statistic: 

Equation: 


Xa- 23.9 — 20.4) —0 
Test Statistic: t. = etal ACU =a? 


4) 


In order to solve this equation, the individual scores, pre-training course and post- 
training course need to be used to calculate the individual differences. These 
scores are then averaged and the average difference is calculated: 

Equation: 


Del) 


From these differences we can calculate the standard deviation across the 
individual differences: 
Equation: 


x(d = Xa)? 
so = where d; = 21; — 29; 
n—-1 


We can now compare the calculated value of the test statistic, 4.12, with the 
critical value. The critical value is a Student's t with degrees of freedom equal to 
the number of pairs, not observations, minus 1. In this case 20 pairs and at 90% 
confidence level ta. = +1.729 at df = 20 - 1 = 19. The calculated test statistic is 
most certainly in the tail of the distribution and thus we cannot accept the null 
hypothesis that there is no difference from the training program. Evidence seems 
indicate that the training aids employees in gaining higher scores. 


Example: 
Exercise: 


Problem: 


A study was conducted to investigate the effectiveness of hypnotism in reducing 
pain. Results for randomly selected subjects are shown in [link]. A lower score 
indicates less pain. The "before" value is matched to an "after" value and the 
differences are calculated. Are the sensory measurements, on average, lower after 
hypnotism? Test at a 5% significance level. 


Subject: A B Cc D E F G H 

Before 6.6 6.5 9.0 10.3 i1i)8: 8.1 6.3 11.6 

After 6.8 2.4 7.4 8.5 8.1 6.1 3.4 20) 
Solution: 


Corresponding "before" and "after" values form matched pairs. (Calculate "after" 
— "before.") 


After data Before data Difference 
6.8 6.6 0.2 

2.4 6.5 -4.1 

7.4 9 -1.6 

8.5 10.3 -1.8 


8.1 11.3 -3.2 


After data Before data Difference 


6.1 8.1 -2 
3.4 6.3 -2.9 
2 11.6 -9.6 


The data for the test are the differences: {0.2, —4.1, —1.6, —1.8, —3.2, —2, —2.9, — 
9.6} 


The sample mean and sample standard deviation of the differences are: 
Lq = —3.13 and sq = 2.91 Verify these values. 


Let zg be the population mean for the differences. We use the subscript d to 
denote "differences." 


Random variable: X , = the mean difference of the sensory measurements 
Ho: Hg 2 9 


The null hypothesis is zero or positive, meaning that there is the same or more 
pain felt after hypnotism. That means the subject shows no improvement. pig is the 
population mean of the differences.) 


Hog) 


The alternative hypothesis is negative, meaning there is less pain felt after 
hypnotism. That means the subject shows improvement. The score should be 
lower after hypnotism, so the difference ought to be negative to indicate 
improvement. 


Distribution for the test: The distribution is a Student's t with df=n-—1=8-1= 
7. Use t7. (Notice that the test is for a single population mean.) 


Calculate the test statistic and look up the critical value using the Student's-t 
distribution: The calculated value of the test statistic is 3.06 and the critical value 
of the t distribution with 7 degrees of freedom at the 5% level of confidence is 
1.895 with a one-tailed test. 


= -3.06 -189 4 


H,: Hy 2 0 
H.: Hy, <9 


X, is the random variable for the differences. 

The sample mean and sample standard deviation of the differences are: 

Lq = —3.13 

Sq = 2.91 

Compare the critical value for alpha against the calculated test statistic. 


The conclusion from using the comparison of the calculated test statistic and the 
critical value will gives us the result. In this question the calculated test statistic is 
3.06 and the critical value is 1.895. The test statistic is clearly in the tail and thus 
we cannot accept the null hypotheses that there is no difference between the two 
situations, hypnotized and not hypnotized. 


Make a decision: Cannot accept the null hypothesis, Ho. This means that pg < 0 
and there is a statistically significant improvement. 


Conclusion: At a 5% level of significance, from the sample data, there is 
sufficient evidence to conclude that the sensory measurements, on average, are 
lower after hypnotism. Hypnotism appears to be effective in reducing pain. 


Example: 

A college football coach was interested in whether the college's strength development 
class increased his players' maximum lift (in pounds) on the bench press exercise. He 
asked four of his players to participate in a study. The amount of weight they could 
each lift was recorded before they took the strength development class. After 
completing the class, the amount of weight they could each lift was again measured. 
The data are as follows: 


Player Player Player Player 
4 


Weight (in pounds) 1 2 3 

Amount of weight lifted prior to 505 AL 338 368 
the class 

Amount of weight lifted after 995 959 330 360 


the class 


The coach wants to know if the strength development class makes his players 
stronger, on average. 

Record the differences data. Calculate the differences by subtracting the amount of 
weight lifted prior to the class from the weight lifted after completing the class. The 
data for the differences are: {90, 11, -8, -8}. 

Lq = 21.3, Sq = 46.7 

Using the difference data, this becomes a test of a single mean. 

Define the random variable: X 4 mean difference in the maximum lift per player. 
The distribution for the hypothesis test is a student's t with 3 degrees of freedom. 
Hp tg 0, Hepa 0 


Calculate the test statistic look up the critical value: Critical value of the test 
Statistic is 0.91. The critical value of the student's t at 5% level of significance and 3 
degrees of freedom is 2.353. 

Decision: If the level of significance is 5%, we cannot reject the null hypothesis, 
because the calculated value of the test statistic is not in the tail. 

What is the conclusion? 

At a 5% level of significance, from the sample data, there is not sufficient evidence to 
conclude that the strength development class helped to make the players stronger, on 


average. 


Chapter Review 
A hypothesis test for matched or paired samples (t-test) has these characteristics: 


¢ Test the differences by subtracting one measurement from the other measurement 

e Random Variable: xq = mean of the differences 

e Distribution: Student’s-t distribution with n— 1 degrees of freedom 

e If the number of differences is small (less than 30), the differences must follow a 
normal distribution. 

¢ Two samples are drawn from the same set of objects. 

e Samples are dependent. 


Formula Review 


Test Statistic (t-score): ft. = ~—~ 


where: 


£q is the mean of the sample differences. jig is the mean of the population differences. sg 
is the sample standard deviation of the differences. n is the sample size. 


Use the following information to answer the next five exercises. A study was conducted 
to test the effectiveness of a software patch in reducing system failures over a six-month 
period. Results for randomly selected installations are shown in [link]. The “before” 


value is matched to an “after” value, and the differences are calculated. The differences 
have a normal distribution. Test at the 1% significance level. 


Installation A B C D E F G H 
Before 3 6 4 2 5 8 2 6 


After 1 fs) 2 0 1 0 2 2 


Exercise: 
Problem: What is the random variable? 


Solution: 


the mean difference of the system failures 


Exercise: 


Problem: State the null and alternative hypotheses. 


Exercise: 
Problem: What conclusion can you draw about the software patch? 


Solution: 


With a p-value 0.0067, we can cannot accept the null hypothesis. There is enough 
evidence to support that the software patch is effective in reducing the number of 
system failures. 


Use the following information to answer next five exercises. A study was conducted to 
test the effectiveness of a juggling class. Before the class started, six subjects juggled as 
many balls as they could at once. After the class, the same six subjects juggled as many 
balls as they could. The differences in the number of balls are calculated. The 
differences have a normal distribution. Test at the 1% significance level. 


Subject A B C D E F 

Before 3) 4 3 2 4 5 

After 4 5 6 4 5 Z 
Exercise: 


Problem: State the null and alternative hypotheses. 


Exercise: 


Problem: What is the sample mean difference? 


Exercise: 


Problem: What conclusion can you draw about the juggling class? 


Use the following information to answer the next five exercises. A doctor wants to know 
if a blood pressure medication is effective. Six subjects have their blood pressures 
recorded. After twelve weeks on the medication, the same six subjects have their blood 
pressure recorded again. For this test, only systolic pressure is of concern. Test at the 
1% significance level. 


Patient A B C D E F 

Before 161 162 165 162 166 171 

After 158 159 166 160 167 169 
Exercise: 


Problem: State the null and alternative hypotheses. 
Solution: 
Ho: Hg 2 9 
He jig <0 


Exercise: 


Problem: What is the test statistic? 


Exercise: 


Problem: What is the sample mean difference? 


Exercise: 


Problem: What is the conclusion? 


Solution: 


We decline to reject the null hypothesis. There is not sufficient evidence to support 
that the medication is effective. 


Homework 


Exercise: 
Problem: 
Ten individuals went on a low-fat diet for 12 weeks to lower their cholesterol. The 


data are recorded in [link]. Do you think that their cholesterol levels were 
significantly lowered? 


Starting cholesterol level Ending cholesterol level 
140 140 
220 230 
110 120 
240 220 
200 190 
180 150 
190 200 


360 300 


Starting cholesterol level Ending cholesterol level 


280 300 
260 240 
Solution: 


p-value = 0.1494 


At the 5% significance level, there is insufficient evidence to conclude that the 
medication lowered cholesterol levels after 12 weeks. 


Use the following information to answer the next two exercises. A new AIDS prevention 
drug was tried on a group of 224 HIV positive patients. Forty-five patients developed 
AIDS after four years. In a control group of 224 HIV positive patients, 68 developed 
AIDS after four years. We want to test whether the method of treatment reduces the 
proportion of patients that develop AIDS after four years or if the proportions of the 
treated group and the untreated group stay the same. 


Let the subscript t = treated patient and ut = untreated patient. 
Exercise: 


Problem: The appropriate hypotheses are: 


a. Ao: Py < Pur and A: py = Put 
b. Ho: Pt < Pur and He: py > Pur 
C. Ao: Pt = Pur and A: py * Pur 
d. Ho: Pt = Pur and Ag: Pt < Put 


Use the following information to answer the next two exercises. An experiment is 
conducted to show that blood pressure can be consciously reduced in people trained in a 
“biofeedback exercise program.” Six subjects were randomly selected and blood 
pressure measurements were recorded before and after the training. The difference 
between blood pressures was calculated (after - before) producing the following results: 
Lq = —10.2 sq = 8.4. Using the data, test the hypothesis that the blood pressure has 
decreased after the training. 

Exercise: 


Problem: The distribution for the test is: 


d. ts 

b. tg 

c. N(-10.2, 8.4) 
= 84 

d. N(-10.2, 24 


Exercise: 


Problem: 


A golf instructor is interested in determining if her new technique for improving 
players’ golf scores is effective. She takes four new students. She records their 18- 
hole scores before learning the technique and then after having taken her class. She 
conducts a hypothesis test. The data are as follows. 


Player Player Player Player 
1 2 a 4 
Mean score before 83 78 93 97 
class 
Mean score after class 80 80 86 86 


The correct decision is: 


a. Reject Ho. 
b. Do not reject the Hp. 


Exercise: 


Problem: 


A local cancer support group believes that the estimate for new female breast 
cancer cases in the south is higher in 2013 than in 2012. The group compared the 
estimates of new female breast cancer cases by southern state in 2012 and in 2013. 
The results are in [link]. 


Southern states 2012 2013 


Alabama 3,450 3,720 
Arkansas 2,150 2,280 
Florida 15,540 15,710 
Georgia 6,970 7,310 
Kentucky 3,160 3,300 
Louisiana 3,320 3,630 
Mississippi 1,990 2,080 
North Carolina 7,090 7,430 
Oklahoma 2,630 2,690 
South Carolina 3,570 3,580 
Tennessee 4,680 5,070 
Texas 15,050 14,980 
Virginia 6,190 6,280 
Solution: 


Test: two matched pairs or paired samples (t-test) 


Random variable: X d 
Distribution: t;» 
Ho: Ha = 0 Ha: Hg > 9 


The mean of the differences of new female breast cancer cases in the south 
between 2013 and 2012 is greater than zero. The estimate for new female breast 
cancer cases in the south is higher in 2013 than in 2012. 


Graph: right-tailed 
p-value: 0.0004 
Decision: Cannot accept Ho 


Conclusion: At the 5% level of significance, from the sample data, there is 
sufficient evidence to conclude that there was a higher estimate of new female 
breast cancer cases in 2013 than in 2012. 


Exercise: 
Problem: 
A traveler wanted to know if the prices of hotels are different in the ten cities that 


he visits the most often. The list of the cities with the corresponding hotel prices 
for his two favorite hotel chains is in [link]. Test at the 1% level of significance. 


Hyatt Regency prices in Hilton prices in 

Cities dollars dollars 
Atlanta 107 169 
Boston 358 209 
Chicago 209 299 
Dallas 209 198 
Denver 167 169 
Indianapolis 179 214 
Los Angeles 179 169 
Nemes 625 459 
City 


Philadelphia 179 159 


Hyatt Regency prices in Hilton prices in 


Cities dollars dollars 
Washington, 
DC 245 239 
Exercise: 
Problem: 


A politician asked his staff to determine whether the underemployment rate in the 
northeast decreased from 2011 to 2012. The results are in [link]. 


Northeastern states 2011 2012 
Connecticut 17.3 16.4 
Delaware 17.4 13.7 
Maine 19,3 16.1 
Maryland 16.0 15.5 
Massachusetts 17.6 18.2 
New Hampshire 15.4 13.5 
New Jersey 19.2 18.7 
New York 18.5 18.7 
Ohio 18.2 18.8 
Pennsylvania 16.5 16.9 


Rhode Island 20.7 22.4 


Northeastern states 2011 2012 


Vermont 14.7 12.3 
West Virginia 15.5 17.3 
Solution: 


Test: matched or paired samples (t-test) 


Difference data: {-0.9, —3.7, —3.2, —0.5, 0.6, —1.9, -0.5, 0.2, 0.6, 0.4, 1.7, -2.4, 1.8} 


Random Variable: X d 
Distribution: Ho: Ug = 0 Hg: bg < 0 


The mean of the differences of the rate of underemployment in the northeastern 
states between 2012 and 2011 is less than zero. The underemployment rate went 
down from 2011 to 2012. 


Graph: left-tailed. 
Decision: Cannot reject Ho. 


Conclusion: At the 5% level of significance, from the sample data, there is not 
sufficient evidence to conclude that there was a decrease in the underemployment 
rates of the northeastern states from 2011 to 2012. 


Bringing It Together 


Use the following information to answer the next ten exercises. indicate which of the 
following choices best identifies the hypothesis test. 


a. independent group means, population standard deviations and/or variances known 

b. independent group means, population standard deviations and/or variances 
unknown 

c. matched or paired samples 

d. single mean 

e. two proportions 

f. single proportion 


Exercise: 
Problem: 
A powder diet is tested on 49 people, and a liquid diet is tested on 36 different 
people. The population standard deviations are two pounds and three pounds, 


respectively. Of interest is whether the liquid diet yields a higher mean weight loss 
than the powder diet. 


Exercise: 
Problem: 
A new chocolate bar is taste-tested on consumers. Of interest is whether the 


proportion of children who like the new chocolate bar is greater than the proportion 
of adults who like it. 


Solution: 


e 
Exercise: 
Problem: 
The mean number of English courses taken in a two-year time period by male and 


female college students is believed to be about the same. An experiment is 
conducted and data are collected from nine males and 16 females. 


Exercise: 
Problem: 


A football league reported that the mean number of touchdowns per game was five. 
A study is done to determine if the mean number of touchdowns has decreased. 


Solution: 


d 
Exercise: 


Problem: 


A study is done to determine if students in the California state university system 
take longer to graduate than students enrolled in private universities. One hundred 
students from both the California state university system and private universities 
are surveyed. From years of research, it is known that the population standard 
deviations are 1.5811 years and one year, respectively. 


Exercise: 
Problem: 


According to a YWCA Rape Crisis Center newsletter, 75% of rape victims know 
their attackers. A study is done to verify this. 


Solution: 


f 
Exercise: 
Problem: 
According to a recent study, U.S. companies have a mean maternity-leave of six 
weeks. 
Exercise: 
Problem: 
A recent drug survey showed an increase in use of drugs and alcohol among local 
high school students as compared to the national percent. Suppose that a survey of 


100 local youths and 100 national youths is conducted to see if the proportion of 
drug and alcohol use is higher locally than nationally. 


Solution: 


e 
Exercise: 
Problem: 
A new SAT study course is tested on 12 individuals. Pre-course and post-course 


scores are recorded. Of interest is the mean increase in SAT scores. The following 
data are collected: 


Pre-course score Post-course score 


1 300 


Pre-course score 


960 


1010 


840 


1100 


1250 


860 


1330 


790 


990 


1110 


740 


Exercise: 


Problem: 


Post-course score 


920 


1100 


880 


1070 


1320 


860 


1370 


770 


1040 


1200 


850 


University of Michigan researchers reported in the Journal of the National Cancer 
Institute that quitting smoking is especially beneficial for those under age 49. In 
this American Cancer Society study, the risk (probability) of dying of lung cancer 
was about the same as for those who had never smoked. 


Solution: 


f 


Exercise: 


Problem: 


Lesley E. Tan investigated the relationship between left-handedness vs. right- 
handedness and motor competence in preschool children. Random samples of 41 
left-handed preschool children and 41 right-handed preschool children were given 
several tests of motor skills to determine if there is evidence of a difference 
between the children based on this experiment. The experiment produced the 
means and standard deviations shown [link]. Determine the appropriate test and 
best distribution to use for that test. 


Left-handed Right-handed 
Sample size Al 41 
Sample mean o75 98.1 
Sample standard deviation 17.5 19.2 


a. Two independent means, normal distribution 

b. Two independent means, Student’s-t distribution 

c. Matched or paired samples, Student’s-t distribution 
d. Two population proportions, normal distribution 


Exercise: 


Problem: 


A golf instructor is interested in determining if her new technique for improving 
players’ golf scores is effective. She takes four (4) new students. She records their 
18-hole scores before learning the technique and then after having taken her class. 
She conducts a hypothesis test. The data are as [link]. 


Player Player 


1 2 
Mean score before 93 78 
class 
Mean score after class 80 80 
This is: 


a. a test of two independent means. 
b. a test of two proportions. 

c. a test of a single mean. 

d. a test of a single proportion. 


Solution: 


a 


Player 


86 


Player 


86 


Statistical Tables 


F Distribution 


Probability p 


FF 


Table entry for p is the critical value F* with 


probability p lying to its right. 


Degrees of 
freedom 

in the 
denominator 


.050 


.010 


Degrees of freedom in the numerator 


39.86 


161.45 


647.79 


4052.2 


405284 


49.50 


199.50 


799.50 


4999.5 


500000 


9.00 


19.00 


39.00 


99.00 


999.00 


53.59 


215.71 


864.16 


5403.4 


540379 


9.16 


19.16 


39.17 


99.17 


999.17 


55.83 


224.58 


899.58 


5624.6 


562500 


9.24 


19.25 


39.25 


99.25 


999.25 


57.24 


230.16 


921.85 


5763.6 


576405 


9.29 


19.30 


39.30 


99.30 


999.30 


58.20 


233.99 


937.11 


5859.0 


585937 


58.91 


236.77 


948.22 


5928.4 


592873 


9.35 


19.35 


39.36 


99.36 


999.36 


F critical values 


.100 


050 


.010 


001 


Degrees of freedom in the numerator 


5.04 5.46 
10.13 9.55 
17.44 16.04 
34.12 30.82 


167.03 148.50 


4.54 4.32 
7.71 6.94 
12.22 10.65 
21.20 18.00 
74.14 61.25 
4.06 3.78 
6.61 5.79 
10.01 8.43 
16.26 13.27 
47.18 37.12 
3.78 3.46 
5.99 5.14 
8.81 7.26 
13.75 10.92 
35.51 27.00 
3.59 3.26 
5.59 4.74 
8.07 6.54 
12.25 9.55 
29.25 21.69 


5.39 


9.28 


15.44 


29.46 


141.11 


4.19 


6.59 


5.34 


5.31 


9.01 


14.88 


28.24 


134.58 


4.05 


6.26 


7.46 


16.21 


5.28 


8.94 


14.73 


27.91 


132.85 


5.27 


8.89 


14.62 


27.67 


131.58 


3.98 


6.09 


9.07 


14.98 


49.66 


3.37 


4.88 


6.85 


10.46 


28.16 


3.01 


4.21 


5.70 


Degrees of 
freedom 

in the 
denominator 


Degrees of freedom in the numerator 


10 


60.19 


241.88 


968.63 


6055.8 


605621 


12 


60.71 


243.91 


976.71 


6106.3 


610668 


9.41 


19.41 


39.41 


99.42 


999.42 


5.22 


8.74 


14.34 


27.05 


128.32 


3.90 


5.91 


8.75 


14.37 


47 Al 


3.27 


4.68 


6.52 


9.89 


15 


61.22 


245.95 


984.87 


6157.3 


615764 


20 


61.74 


248.01 


993.10 


6208.7 


620908 


9.44 


19.45 


39.45 


25 


62.05 


249.26 


998.08 


6239.8 


624017 


9.45 


19.46 


39.46 


99.46 


999.46 


5.17 


8.63 


14.12 


26.58 


125.84 


3.83 


5.77 


8.50 


13.91 


45.70 


3.19 


4.52 


6.27 


9.45 


30 


62.26 


250.10 


1001.4 


6260.6 


626099 


40 


62.53 


251.14 


1005.6 


6286.8 


628712 


9.47 


19.47 


39.47 


F critical values (continued) 


Degrees of 
freedom 

in the 
denominator 


.001 


.100 


050 


025 


.010 


001 


.100 


050 


025 


.010 


001 


Degrees of freedom in the numerator 


26.92 


2.94 


26.42 


2.90 


4.00 


5.37 


7.72 


17.99 


2.67 


3.57 


4.67 


6.47 


13.71 


25.91 


13.32 


25.39 


Degrees of freedom in the numerator 


10.56 


5.71 


8.02 


5.08 


6.99 


4.72 


6.42 


13.48 


4.48 


6.06 


25.08 


12.69 


4.32 


5.80 


4.20 


5.61 


4.10 


5.47 


10 


11 


12 


13 


14 


15 


Degrees of freedom in the numerator 


22.86 


18.64 


16.39 


2.92 


4.10 


5.46 


7.56 


13.90 


2.73 


3.71 


4.83 


6.55 


12.55 


2.66 


3.59 


4.63 


6.22 


11.56 


2.61 


3.49 


4.47 


5.95 


10.80 


2.56 


3.41 


4.35 


5.74 


10.21 


2.52 


3.34 


4.24 


5.56 


9.73 


2.49 


12.56 


2.61 


3.48 


4.47 


5.99 


11.28 


2.54 


3.36 


4.28 


5.67 


10.35 


2.48 


3.26 


4.12 


5.41 


9.63 


2.43 


3.18 


4.00 


5.21 


9.07 


2.39 


3.11 


3.89 


5.04 


8.62 


2.36 


10.37 


2.38 


3.07 


3.85 


5.06 


9.20 


2.30 


2.95 


3.66 


4.74 


8.35 


2.24 


2.85 


3.51 


4.50 


2.20 


3.39 


Degrees of freedom in the numerator 


.050 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 
025 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 
.010 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 
001 16.59 11.34 9.34 8.25 7.97 7.09 6.74 6.47 


F critical values (continued) 


Degrees of freedom in the numerator 


Degrees of 

aerials p 10 12 15 20 25 30 40 50 

denominator 
.100 2.54 2.50 2.46 2.42 2.40 2.38 2.36 2.35 
.050 3.35 3.28 3.22 3.15 3.11 3.08 3.04 3.02 

8 025 4.30 4.20 4.10 4.00 3.94 3.89 3.84 3.81 
.010 5.81 5.67 5.52 5.36 5.26 5.20 5.12 5.07 
001 11.54 11.19 10.84 10.48 10.26 10.11 9.92 9.80 
.100 2.42 2.38 2.34 2.30 2.27 2.25 2.23 2.22 
.050 3.14 3.07 3.01 2.94 2.89 2.86 2.83 2.80 

9 025 3.96 3.87 3.77 3.67 3.60 3.56 3.51 3.47 
.010 5.26 5.11 4.96 4.81 4.71 4.65 4.57 4.52 
001 9.89 9.57 9.24 8.90 8.69 8.55 8.37 8.26 
.100 2.32 2.28 2.24 2.20 2.17 2.16 2.13 2.12 
.050 2.98 2.91 2.85 2.77 2.73 2.70 2.66 2.64 

10 025 3.72 3.62 3.52 3.42 3.35 3.31 3.26 3.22 
.010 4.85 4.71 4.56 4.41 4.31 4.25 4.17 4.12 
.001 8.75 8.45 8.13 7.80 7.60 7.47 7.30 7.19 


11 .100 2.25 2.21 2.17 2.12 2.10 2.08 2.05 2.04 


Degrees of freedom in the numerator 


050 2.85 2.79 272. 2.65 2.60 2.57 2.53 2.51 
025 3.53 3.43 3.33 3.23 3.16 3.12 3.06 3.03 
010 4.54 4.40 4.25 4.10 4.01 3.94 3.86 3.81 
001 7.92 7.63 7.32 7.01 6.81 6.68 6.52 6.42 
100 2.19 2.15 2.10 2.06 2.03 2.01 1.99 1.97 
.050 2.75 2.69 2.62 2.54 2.50 2.47 2.43 2.40 
12 025 3.37 3.28 3.18 3.07 3.01 2.96 2.91 2.87 
010 4.30 4.16 4.01 3.86 3.76 3.70 3.62 3.57 
001 7.29 7.00 6.71 6.40 6.22 6.09 5.93 5.83 
100 2.14 2.10 2.05 2.01 1.98 1.96 1.93 1.92 
050 2.67 2.60 2.53 2.46 2.41 2.38 2.34 2.31 
13 025 3.25 3.15 3.05 2.95 2.88 2.84 2.78 2.74 
010 4.10 3.96 3.82 3.66 3.57 3.51 3.43 3.38 
001 6.80 6.52 6.23 5.93 5.75 5.63 5.47 5.37 
100 2.10 2.05 2.01 1.96 1.93 1.91 1.89 1.87 
050 2.60 2.53 2.46 2.39 2.34 2.31 2.27 2.24 
14 025 3.15 3.05 2.95 2.84 2.78 2.73 2.67 2.64 
.010 3.94 3.80 3.66 3.51 3.41 3.35 3.27 3.22 
.001 6.40 6.13 5.85 5.56 5.38 5.25 5.10 5.00 
.100 2.06 2.02 1.97 1.92 1.89 1.87 1.85 1.83 
050 2.54 2.48 2.40 2.33 2.28 2.25 2.20 2.18 
15 025 3.06 2.96 2.86 2.76 2.69 2.64 2.59 2.55 
.010 3.80 3.67 3.52 3.37 3.28 3.21 3.13 3.08 
001 6.08 5.81 5.04 5.25 5.07 4.95 4.80 4.70 


F critical values (continued) 


Degrees of freedom in the numerator 


Degrees of 
on p 1 2 3 4 5 6 7 8 
denominator 
100 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 
050 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 
16 025 6.12 4.69 4.08 3.73 3.50 3.34 3.22 3.12 
.010 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 
.001 16.12 10.97 9.01 7.94 Td 6.80 6.46 6.19 
.100 3.03 2.64 2.44 2.31 2,22 2.15 2.10 2.06 
.050 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 
17 025 6.04 4.62 4.01 3.66 3.44 3.28 3.16 3.06 
.010 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 
001 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 
.100 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 
.050 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 
18 025 5.98 4.56 3.95 3.61 3.38 3.22 3.10 3.01 
.010 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 
001 15.38 10.39 8.49 7.46 6.81 6.35 6.02 5.76 
.100 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 
.050 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 
19 025 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 
.010 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 
001 22.86 16.39 13.90 12.56 11.71 11.13 10.70 10.37 
20 .100 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 
.050 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 
025 5.87 4.46 3.86 3.51 3.29 3.13 3.01 2.91 


.010 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 


21 


22 


23 


F critical values (continued) 


Degrees of 
freedom 

in the 
denominator 


16 


Degrees of freedom in the numerator 


14.82 


2.96 


14.20 


9.95 


2.57 


3.47 


5.66 


9.47 


8.10 


2.36 


3.07 


3.82 


3.75 


4.76 


7.67 


7.10 


2.23 


2.84 


3.48 


4.37 


6.95 


2.22 


2.82 


3.44 


4.31 


6.81 


2.21 


2.80 


3.41 


4.26 


6.70 


Degrees of freedom in the numerator 


10 


2.49 


2.99 


12 


2.89 


3.55 


15 


20 


1.89 


2.28 


2.68 


3.26 


25 


3.94 


6.08 


30 


3.71 


5.65 


40 


1.81 


2.15 


2.51 


3.02 


3.54 


5.33 


50 


2.47 


2.97 


5.44 


1.98 


3.41 


5.09 


60 


2.93 


17 


18 


19 


20 


21 


22 


Degrees of freedom in the numerator 


4.95 


1.90 


5.55 


2.68 


4.70 


5.27 


4.99 


1.86 


2.23 


4.82 


1.83 


2.18 


2.55 


3.07 


2.79 


1.73 


23 


F critical values (continued) 


Degrees of 
freedom 

in the 
denominator 


24 


25 


26 


050 


025 


.010 


001 


.100 


.050 


025 


.010 


001 


Degrees of freedom in the numerator 


2.23 


2.60 


4.48 


2.15 


2.50 


2.98 


4.33 


2.07 


2.39 


2.83 


2.02 


2.32 


2.73 


3.89 


1.71 


2.00 


2.29 


2.69 


3.79 


Degrees of freedom in the numerator 


2.62 


3.90 


5.98 


2.09 


2.60 


3.13 


3.85 


5.89 


2.08 


1.94 


2.21 


2.54 


3.53 


3.46 


1.96 


4.71 


1.88 


27 


28 


29 


30 


40 


.050 


025 


Degrees of freedom in the numerator 


3.37 


4.27 


5.53 


9.12 


2.51 


3.35 


3.63 


2.74 


2.59 


3.10 


3.82 


5.80 


2.07 


2.57 


3.08 


3.78 


5.73 


2.06 


2.56 


3.06 


3.75 


5.66 


2.06 


2.55 


3.04 


3.73 


5.59 


2.05 


2.53 


3.03 


3.70 


5.53 


2.00 


2.45 


2.90 


2.32 


2.73 


2.65 


2.18 


2.53 


F critical values (continued) 


Degrees of 
freedom 

in the 
denominator 


24 


25 


26 


27 


.010 


001 


Degrees of freedom in the numerator 


4.31 


Degrees of freedom in the numerator 


10 


12 


2.47 


15 


20 


2.25 


25 


1.66 


2.18 


30 


40 


50 


60 


.010 


001 


28 025 


29 025 


30 025 


40 025 


.001 


F critical values (continued) 


Degrees of freedom in the numerator 


3.06 2.93 
4.41 4.17 
1.84 1.79 
2.19 2.12 
2.55 2.45 
3.03 2.90 
4.35 4.11 
1.83 1.78 
2.18 2.10 
2.53 2.43 
3.00 2.87 
4.29 4.05 
1.82 1.77 
2.16 2.09 
2.51 2.41 
2.98 2.84 
4.24 4.00 
1.76 jal 
2.08 2.00 
2.39 2.29 
2.80 2.66 
3.87 3.64 


2.78 


3.75 


1.66 


1.92 


2.63 


3.66 


1.69 


1.96 


2.23 


2.60 


3.60 


1.68 


1.94 


2.21 


2.57 


3.54 


1.67 


1.93 


2.20 


2.55 


3.49 


1.61 


1.84 


2.07 


2.37 


2.54 


1.78 


Degrees of freedom in the numerator 


1.94 


2.38 


3.23 


Degrees of 
freedom 

in the 
denominator 


50 


60 


100 


200 


1000 


Degrees of freedom in the numerator 


5.86 


2.11 


2.65 


2.92 


2.37 


3.34 


1.91 


2.31 


2.70 


3.21 


4.48 


1.88 


2.26 


2.63 


3.11 


4.29 


1.85 


2.22 


2.58 


3.04 


1.77 


2.41 


1.73 


2.69 


3.61 


1.70 


2.20 


2.53 


3.44 


1.66 


1.93 


F critical values (continued) 


Degrees of 
freedom 

in the 
denominator 


50 


60 


100 


200 


001 


Degrees of freedom in the numerator 


10.89 


6.96 


5.46 


4.65 


4.14 


Degrees of freedom in the numerator 


10 


2.41 


12 


15 


20 


1.62 


1.97 


25 


3.78 


30 


3.51 


40 


3.30 


50 


60 


1000 


001 


.100 


.050 


025 


.010 


001 


F critical values (continued) 


Numerical entries represent the probability that a standard normal random variable is between 0 and z where 
op 


— 


o 


Degrees of freedom in the numerator 


2.99 


Area 


2.90 


2.20 


2.77 


2.67 


2.42 


1.43 


1.90 


2.30 


Standard Normal Probability Distribution: Z Table 


2.26 


2.14 


2.15 2.00 
1.35 1.30 
1.47 1.41 
1.58 1.50 
1.72 1.61 
2.02 1.87 


0.07 


0.0279 


0.0675 


0.1064 


0.1443 


0.1808 


0.2157 


0.2486 


0.2794 


0.3078 


3.3 


3.4 


0.00 


0.3159 


0.3413 


0.3643 


0.3849 


0.4032 


0.4192 


0.4332 


0.4452 


0.4554 


0.4641 


0.4713 


0.4772 


0.4821 


0.4861 


0.4893 


0.4918 


0.4938 


0.4953 


0.4965 


0.4974 


0.4981 


0.4987 


0.4990 


0.4993 


0.4995 


0.4997 


0.01 


0.3186 


0.3438 


0.3665 


0.3869 


0.4049 


0.4207 


0.4345 


0.4463 


0.4564 


0.4649 


0.4719 


0.4778 


0.4826 


0.4864 


0.4896 


0.4920 


0.4940 


0.4955 


0.4966 


0.4975 


0.4982 


0.4987 


0.4991 


0.4993 


0.4995 


0.4997 


Standard Normal Distribution 


Student's t Distribution 


Upper critical values of Student's t Distribution with v Degrees of Freedom 


For selected probabilities, a, the table shows the values t,, such that P(t, > tq) = a, where t, is a Student’s t 
random variable with v degrees of freedom. For example, the probability is .10 that a Student’s t random variable 
with 10 degrees of freedom exceeds 1.372. 


v 0.10 0.05 0.025 0.01 0.005 0.001 
1 3.078 6.314 12.706 31.821 63.657 318.313 
2 1.886 2.920 4.303 6.965 9.925 22.327 
3 1.638 2.353 3.182 4.541 5.841 10.215 
4 1.533 2.132 2.776 3.747 4.604 7.173 
5 1.476 2.015 2.571 3.365 4.032 5.893 
6 1.440 1.943 2.447 3.143 3.707 5.208 
7 1.415 1.895 2.365 2.998 3.499 4.782 
8 1.397 1.860 2.306 2.896 3.355 4.499 
9 1.383 1.833 2.262 2.821 3.250 4.296 
10 1.372 1.812 2.228 2.764 3.169 4.143 
11 1.363 1.796 2.201 2.718 3.106 4.024 
12 1.356 1.782 2.179 2.681 3.055 3.929 
13 1.350 1.771 2.160 2.650 3.012 3.852 
14 1.345 1.761 2.145 2.624 2.977 3.787 
15 1.341 1.753 2.131 2.602 2.947 3.733 


16 1.337 1.746 2.120 2.583 2.921 3.686 


v 0.10 


17 1.333 
18 1.330 
19 1.328 
20 1.325 
21 1.323 
22 1.321 
23 1.319 
24 1.318 
25 1.316 
26 1.315 
27 1.314 
28 1.313 
29 1.311 
30 1.310 
40 1.303 
60 1.296 
100 1.290 
00 1.282 


Probability of Exceeding the Critical ValueNIST/SEMATECH e-Handbook of Statistical Methods, 
http://www. itl nist.gov/div898/handbook/, September 2011. 


x 


0.05 


1.740 


1.734 


1.729 


1.725 


1.721 


L717 


1.714 


1.711 


1.708 


1.706* 


1.703 


1.701 


1.699 


1.697 


1.684 


1.671 


1.660 


1.645 


x’ Probability Distribution 


0.025 


2.110 


2.101 


2.093 


2.086 


2.080 


2.074 


2.069 


2.064 


2.060 


2.056 


2.052 


2.048 


2.045 


2.042 


2.021 


2.000 


1.984 


1.960 


0.01 


2.567 


2.552 


2.539 


2.528 


2.518 


2.508 


2.500 


2.492 


2.485 


2.479 


2.473 


2.467 


2.462 


2.457 


2.423 


2.390 


2.364 


2.326 


0.005 


2.898 


2.878 


2.861 


2.845 


2.831 


2.819 


2.807 


2.797 


2.787 


2.779 


2.771 


2.763 


2.756 


2.750 


2.704 


2.660 


2.626 


2.576 


0.001 


3.646 


3.610 


3.579 


3.552 


3.527 


3.505 


3.485 


3.467 


3.450 


3.435 


3.421 


3.408 


3.396 


3.385 


3.307 


3.232 


3.174 


3.090 


df 


0.995 


0.000 


0.010 


0.072 


0.207 


0.412 


0.676 


0.989 


1.344 


1.735 


2.156 


2.603 


3.074 


3.565 


4.075 


4.601 


5.142 


5.697 


6.265 


6.844 


7.434 


8.034 


8.643 


9.260 


9.886 


10.520 


11.160 


11.808 


0.990 


0.000 


0.020 


0.115 


0.297 


0.554 


0.872 


1.239 


1.646 


2.088 


2.558 


3.053 


3.571 


4.107 


4.660 


5.229 


5.812 


6.408 


7.015 


7.633 


8.260 


8.897 


9.542 


10.196 


10.856 


11.524 


12.198 


12.879 


0.975 


0.001 


0.051 


0.216 


0.484 


0.831 


1.237 


1.690 


2.180 


2.700 


3.247 


3.816 


4.404 


5.009 


5.629 


6.262 


6.908 


7.564 


8.231 


8.907 


9.591 


10.283 


10.982 


11.689 


12.401 


13.120 


13.844 


14.573 


0.950 


0.004 


0.103 


0.352 


0.711 


1.145 


1.635 


2.167 


2.733 


3.325 


3.940 


4.575 


5.226 


5.892 


6.571 


7.261 


7.962 


8.672 


9.390 


10.117 


10.851 


11.591 


12.338 


13.091 


13.848 


14.611 


15.379 


16.151 


0.900 


0.016 


0.211 


0.584 


1.064 


1.610 


2.204 


2.833 


3.490 


4.168 


4.865 


5.578 


6.304 


7.042 


7.790 


8.547 


9.312 


10.085 


10.865 


11.651 


12.443 


13.240 


14.041 


14.848 


15.659 


16.473 


17.292 


18.114 


0.100 


2.706 


4.605 


6.251 


7.779 


9.236 


10.645 


12.017 


13.362 


14.684 


15.987 


17.275 


18.549 


19.812 


21.064 


22.307 


23.542 


24.769 


25.989 


27.204 


28.412 


29.615 


30.813 


32.007 


33.196 


34.382 


35.563 


36.741 


0.050 


3.841 


5.991 


7.815 


9.488 


11.070 


12.592 


14.067 


15.507 


16.919 


18.307 


19.675 


21.026 


22.362 


23.685 


24.996 


26.296 


27.587 


28.869 


30.144 


31.410 


32.671 


33.924 


35.172 


36.415 


37.652 


38.885 


40.113 


0.025 


5.024 


7.378 


9.348 


11.143 


12.833 


14.449 


16.013 


17.535 


19.023 


20.483 


21.920 


23.337 


24.736 


26.119 


27.488 


28.845 


30.191 


31.526 


32.852 


34.170 


35.479 


36.781 


38.076 


39.364 


40.646 


41.923 


43.195 


45 


46 


df 
28 
29 
30 
40 
50 
60 
70 
80 
90 


100 


0.995 


12.461 


13.121 


13.787 


20.707 


27.991 


35.534 


43.275 


51.172 


59.196 


67.328 


0.990 


13.565 


14.256 


14.953 


22.164 


29.707 


37.485 


45.442 


53.540 


61.754 


70.065 


0.975 


15.308 


16.047 


16.791 


24.433 


32.357 


40.482 


48.758 


57.153 


65.647 


74,222 


Area to the Right of the Critical Value of x2 


0.950 


16.928 


17.708 


18.493 


26.509 


34.764 


43.188 


51.739 


60.391 


69.126 


77.929 


0.900 


18.939 


19.768 


20.599 


29.051 


37.689 


46.459 


55.329 


64.278 


73.291 


82.358 


0.100 


37.916 


39.087 


40.256 


51.805 


63.167 


74.397 


85.527 


96.578 


107.565 


118.498 


0.050 


41.337 


42.557 


43.773 


55.758 


67.505 


79.082 


90.531 


101.879 


113.145 


124.342 


0.025 


44.461 


45.722 


46.979 


59.342 


71.420 


83.298 


95.023 


106.629 


118.136 


129.561 


0.( 


48 


49 


50 


63 


76 


88 


10 


11 


12 


13 


Mathematical Phrases, Symbols, and Formulas 


English Phrases Written Mathematically 


When the English says: Interpret this as: 
X is at least 4. X24 
The minimum of X is 4. X24 
X is no less than 4. X24 
X is greater than or equal to 4. X24 
X is at most 4. X<4 
The maximum of X is 4. xX<4 
X is no more than 4. xX<4 
X is less than or equal to 4. xX<4 
X does not exceed 4. xX<4 
X is greater than 4. xX>4 
X is more than 4. xX>4 
X exceeds 4. X>4 
X is less than 4. x<4 
There are fewer X than 4. xX<4 
X is 4. xX=4 
X is equal to 4. xX=4 
X is the same as 4. xX=4 
X is not 4. X#4 


X is not equal to 4. X#4 


When the English says: 


X is not the same as 4. 


X is different than 4. 


Symbols and Their Meanings 


Chapter (1st used) 


Sampling and Data 


Sampling and Data 


Descriptive Statistics 
Descriptive Statistics 
Descriptive Statistics 
Descriptive Statistics 
Descriptive Statistics 


Descriptive Statistics 


Descriptive Statistics 


Descriptive Statistics 


Descriptive Statistics 


Descriptive Statistics 
Descriptive Statistics 


Probability Topics 


Symbol 


{} 


Interpret this as: 


X#4 


X#4 


Spoken 


The square root of 


Pi 


Quartile one 
Quartile two 
Quartile three 
interquartile range 
x-bar 


mu 


s squared 


sigma 


sigma squared 
capital sigma 


brackets 


Meaning 

same 

3.14159... (a specific 
number) 

the first quartile 

the second quartile 
the third quartile 

Q3 — Q; = IQR 
sample mean 
population mean 


sample standard 
deviation 


sample variance 


population standard 
deviation 


population variance 
sum 


set notation 


Chapter (1st used) 
Probability Topics 


Probability Topics 


Probability Topics 


Probability Topics 


Probability Topics 


Probability Topics 


Probability Topics 


Probability Topics 


Probability Topics 


Probability Topics 


Discrete Random 
Variables 


Discrete Random 
Variables 


Discrete Random 
Variables 


Discrete Random 
Variables 


Discrete Random 
Variables 


Discrete Random 
Variables 


Symbol 


IV 


lA 


Spoken 
S 


Event A 


probability of A 


probability of A 
given B 


prob. of A or B 


prob. of A and B 


A-prime, 
complement of A 


prob. of 
complement of A 


green on first pick 


prob. of green on 
first pick 


prob. density 
function 


Xx 


the distribution of 
x 


greater than or 
equal to 


less than or equal 
to 


equal to 


Meaning 
sample space 
event A 


probability of A 
occurring 


prob. of A occurring 
given B has occurred 


prob. of A or B or both 
occurring 


prob. of both A and B 
occurring (same time) 


complement of A, not A 


same 


same 


same 


same 


the random variable X 


same 


same 


same 


same 


Chapter (1st used) 


Discrete Random 
Variables 


Continuous Random 
Variables 


Continuous Random 
Variables 


Continuous Random 
Variables 


Continuous Random 
Variables 


Continuous Random 
Variables 


Continuous Random 
Variables 


The Normal 
Distribution 


The Normal 
Distribution 


The Normal 
Distribution 


The Central Limit 
Theorem 


The Central Limit 
Theorem 


The Central Limit 
Theorem 


Confidence Intervals 


Confidence Intervals 


Symbol 


CL 


CI 


Spoken 


not equal to 


f of x 


prob. density 
function 


uniform 
distribution 


exponential 
distribution 


f of x equals 


normal 
distribution 


Z-Score 


standard normal 
dist. 


X-bar 


mean of X-bars 


standard deviation 


of X-bars 


confidence level 


confidence 
interval 


Meaning 


same 


function of x 


same 


same 


same 


same 


decay rate (for exp. dist.) 


same 


same 


same 


the random variable X- 
bar 


the average of X-bars 


same 


same 


same 


Chapter (1st used) 


Confidence Intervals 


Confidence Intervals 


Confidence Intervals 


Confidence Intervals 


Confidence Intervals 


Confidence Intervals 


Confidence Intervals 


Hypothesis Testing 
Hypothesis Testing 


Hypothesis Testing 


Hypothesis Testing 


Hypothesis Testing 


Hypothesis Testing 


Hypothesis Testing 


Hypothesis Testing 


Hypothesis Testing 


Symbol 


EBM 


EBP 


ct 
[2 


Hi — Pe 


P',— P', 


Pi — P2 


Spoken 


error bound for a 
mean 


error bound for a 
proportion 


Student's t- 
distribution 


degrees of 
freedom 


student t with a/2 
area in right tail 


p-prime 


q-prime 


H-naught, H-sub 0 
H-a, H-sub a 


H-1, H-sub 1 


alpha 


beta 


X1-bar minus X2- 
bar 


mu-1 minus mu-2 


P1-prime minus 
P2-prime 


p1 minus p2 


Meaning 


same 


same 


same 


same 


same 


sample proportion of 
success 


sample proportion of 
failure 


null hypothesis 
alternate hypothesis 
alternate hypothesis 


probability of Type I 
error 


probability of Type I 
error 


difference in sample 
means 


difference in population 


means 


difference in sample 
proportions 


difference in population 


proportions 


Chapter (1st used) Symbol Spoken 


Chi-Square 5 
Distribution x Ky-square 
Chi-Square 
Distribution O Observed 
Chi-Square 
Distribution E Expected 


Linear Regression 


and Correlation oe pecguale spins ax 


Linear Regression z 


: -hat 
and Correlation y y 
Linear Regression _ sample correlation 
and Correlation coefficient 
Linear Regression - error term for a 
and Correlation regression line 
Linear Regression Sum of Squared 

: SSE 
and Correlation Errors 


F-Distribution and 


ANOVA F F-ratio 


Symbols and their Meanings 


Formulas 


Symbols you must know 

Population 

N Size 
yu Mean 


Oo Variance 


Meaning 


Chi-square 


Observed frequency 


Expected frequency 


equation of a straight line 


estimated value of y 


same 
same 
same 
F-ratio 

Sample 

n 

x 

52 


Oo 


Pp 


Single data set formulae 


Population 


b= EQ) =o) a) 


Q3 = Stet) ,Q1= {ett) 


N 2 
gS a yet (xy = 7) 
Single data set formulae 


Population 


p= E(x) = FH Dia (mi - fi) 


N 2 
eae a ,(m;— p)”- fi 


CV = 7-100 


Basic probability rules 


P(ANB) = P(A|B)- P(B) 


Standard deviation 


Proportion 


Arithmetic mean 


Geometric mean 
Inter-quartile range 
IQR =Q3- Q1 


Variance 


Arithmetic mean 


Geometric mean 


Variance 


Coefficient of 
variation 


P(AU B) = P(A) + P(B) — P(AN B) 


P(AN B) = P(A)- P(B) or P(A|B) = P(A) 


Q3 = Stet) Qi = fart) 


5? = = ee x)? 


Multiplication rule 
Addition rule 


Independence test 


Hypergeometric distribution formulae 


E(X) = b= np 
o? = (4=*)np(q) 


Binomial distribution formulae 
P(x) = ay") 
E(X) = b= np 

o” = np(q) 


Geometric distribution formulae 


Probability 
= when z is 
P(X=2)=(1—p)" (Pp) | he first 
success. 
= ‘ Mean 
a ae Variance 


Poisson distribution formulae 


P(2) =F 
E(X) =p 
=p 


Uniform distribution formulae 


f(z) = 7 fora<a<b 


Combinatorial equation 


Probability equation 


Mean 


Variance 


Probability density function 


Arithmetic mean 


Variance 


Probability 

when z is 

the 

number of P(X 
failures 

before first 


success 
Mean b= 

x 2 72 
Variance oc = + 


Probability equation 


Mean 


Variance 


PDF 


EX) == ate Mean 


2 
go = (b-a) Variance 


Exponential distribution formulae 


P(X <a2)=1l-e™ Cumulative probability 
E(X) =p= 5, orm =7; Mean and decay factor 
ore Variance 


The following page of formulae requires the use of the "Z", "t", "x?" or "F'" tables. 


i= a Z-transformation for normal distribution 
—npl 
Z= Trapt Normal approximation to the binomial 
Probability (ignores Confidence intervals 
subscripts) [bracketed symbols equal margin of error] 
Hypothesis testing (subscripts denote locations on respective distribution tables) 


Interval for the population mean when sigma is known 


ve @ + [Zain | 
Interval for the population mean when sigma is unknown but 
Fe as to n > 30 
Cc 4 7 
Ae |Z.ai2) | 


- Interval for the population mean when sigma is unknown but 
i,= 2 n < 30 


a+ tn), (0/2) nal 


2 went Interval for the population proportion 
ee pit z (a/2)/ a 


t, = Mifrac Interval for difference between two means with matched pairs 


d+ It n—1),(a/2) +z where sq is the deviation of the differences 


te Interval for difference between two means when sigmas are 
Ze — (#1—£2)—40 known 


ja+d — — ot oF 
ny | ng (x1 _ £2) + Z (c/2) ey $+ 


ne 


Interval for difference between two means with equal variances 
when sigmas are unknown 


en ae es -_ 2 2 
te = oe (Bip) ee tencarny) (2 — ) where 
(2-2) 
ny n2 2 2 2 
(s1) 4 £2) 
df _ nyY1 ng 
| (a5) (9) ) 
ny-1 ny ! ng-1 ng 
Z. = (pty —ply)—60 Interval for difference between two population proportions 
14 (qi Io (qi: 1, (ql fl (ql 
Miah Bete (ply — ply) + Zio 22 + re 
Tests for GOF, Independence, and Homogeneity 
2 (n-l)s* 2 (O-B)? 7 _ 
Xe R2 XxX; = U-~—~ where O = observed values and E = expected 
0 c E 
values 
R= st Where st is the sample variance which is the larger of the two 
C83 sample variances 


The next 3 formulae are for determining sample size with confidence intervals. 
(note: E represents the margin of error) 


24)" Z?\ (0.25 Z2.\ptlat 
n= 7 pe: hen Aq) tera) 
Use when sigma is nS EE ee 
known q Use when p! is unknown Use when pt is uknown 
E=Z-4u E = pl—p E=pl—p 


Simple linear regression formulae for y = a + b(z) 


X[(z—-%)(y-9)] Sy _  / SSR F ar 
: oo SST Correlation coefficient 


—yseayeny-g? OY 


X|(2—2)(y-y)] Se 8 
= X(a—z)" _ 35, yx (34) 
a=y— (2) 
a \2 Se 
82 = x(yi—9:) — #1 i 
e n—k n—k 
s2 ey 8 
Sp = Vea? (n—1)s? 
£2 — b—Bo 


b+ [tn—2,0/2S0| 


ta/2*Se (V3 eee) ) 
ta/2*8e (\: eee ) 


ANOVA formulae 


gt 


(== 


SSR = 3(6:- 9) 


Coefficient b (slope) 


y-intercept 


Estimate of the error variance 


Standard error for coefficient b 


Hypothesis test for coefficient 6 


Interval for coefficient B 


Interval for expected value of y 


Prediction interval for an individual y 


Sum of squares regression 


Sum of squares error 


Sum of squares total 


Coefficient of determination 


The following is the breakdown of a one-way ANOVA table for linear regression. 


Source of Sum of Degrees of : 

ae Mean squares F-ratio 
variation squares freedom 
Regression SSR lork—1 MSR = See B= see 


Error SSE n—k MSE = 58£ 


dfr 


Total SST n—-1 


